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ON 

q | Abstract 

An information theoretic framework for unequal error protection is developed in terms of the exponential 
q ' error bounds. The fundamental difference between the bit-wise and message-wise unequal error protection (UEP) 

. is demonstrated, for fixed length block codes on DMCs without feedback. Effect of feedback is investigated via 

variable length block codes. It is shown that, feedback results in a significant improvement in both bit-wise and 
message-wise UEP (except the single message case for missed detection). The distinction between false-alarm 
and missed-detection formalizations for message-wise UEP is also considered. All results presented are at rates 
close to capacity. 
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c/3 ' I. Introduction 

O . 

— i, Classical theoretical framework for communication [35] assumes that all information is equally important. 
In this framework, the communication system aims to provide a uniform error protection to all messages: any 

■ particular message being mistaken as any other is viewed to be equally costly. With such uniformity assumptions, 
O ', reliability of a communication scheme is measured by either the average or the worst case probability of error, 

' over all possible messages to be transmitted. In information theory literature, a communication scheme is said to 

■ be reliable if this error probability can be made small. Communication schemes designed with this framework 
^ \ turn out to be optimal in sending any source over any channel, provided that long enough codes can be employed. 
O ' This homogeneous view of information motivates the universal interface of "bits" between any source and any 

channel [35], and is often viewed as Shannon's most significant contribution. 

In many communication scenarios, such as wireless networks, interactive systems, and control applications, 
. £h ' where uniformly good error protection becomes a luxury; providing such a protection to the entire information 
^ . might be wasteful, if not infeasible. Instead, it is more efficient here to protect a crucial part of information better 
^ \ than the rest. For example, 

• In a wireless network, control signals like channel state, power control, and scheduling information are often 
more important than the payload data, and should be protected more carefully. Thus even though the final 
objective is delivering the payload data, the physical layer should provide a better protection to such protocol 
information. Similarly for the Internet, packet headers are more important for delivering the packet and need 
better protection to ensure that the actual data gets through. 

• Another example is transmission of a multiple resolution source code. The coarse resolution needs a better 
protection than the fine resolution so that the user at least obtains some crude reconstruction after bad noise 
realizations. 

• Controlling unstable plants over noisy communication link [33] and compressing unstable sources [34] 
provide more examples where different parts of information need different reliability. 

In contrast with the classical homogeneous view, these examples demonstrate the heterogeneous nature of infor- 
mation. Furthermore the practical need for unequal error protection (UEP) due to this heterogeneity demonstrated 
in these examples is the reason why we need to go beyond the conventional content-blind information processing. 
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Consider a message set Ai = {1, 2, 3, ... , 2 fc } for a block code. Note that members of this set, i.e. "messages", 
can also be represented by length k strings of information bits, b = [&i, &2> • • • A block code is composed of 
an encoder which maps the messages, M G M. into channel inputs and a decoder which maps channel outputs to 
decoded message, M G M. An error event for a block code is {M ^ M}. In most information theory texts, when 
an error occurs, the entire bit sequence b is rejected. That is, errors in decoding the message and in decoding 
the information bits are treated similarly. We avoid this, and try to figure out what can be achieved by analyzing 
the errors of different subsets of bits separately. 

In the existing formulations of unequal error protection codes [38] in coding theory, the information bits are 
partitioned into subsets, and the decoding errors in different subsets of bits are viewed as different kinds of errors. 
For example, one might want to provide a better protection to one subset of bits by ensuring that errors in these 
bits are less probable than the other bits. We call such problems as "bit-wise UEP". Previous examples of packet 
headers, multiple resolution codes, etc. belong to this category of UEP. 

However, in some situations, instead of bits one might want to provide a better protection to a subset of 
messages. For example, one might consider embedding a special message in a normal fc-bit code, i.e., transmitting 
one of 2 k + 1 messages, where the extra message has a special meaning and requires a smaller error probability. 
Note that the error event for the special message is not associated to error in any particular bit or set of bits. 
Instead, it corresponds to a particular bit-sequence (i.e. message) being decoded as some other bit-sequence. 
Borrowing from hypothesis testing, we can define two kinds of errors corresponding to a special message. 

• Missed-detection of a message i occurs when transmitted message M is i and decoded message M is some 
other message j ^ i. Consider a special message indicating some system emergency which is too costly to 
be missed. Clearly, such special messages demand a small missed detection probability. Missed detection 
probability of a message is simply the conditional error probability after its transmission. 

• False-alarm of a message i occurs when transmitted message M is some other message j ^ i and decoded 
message M is i. Consider the reboot message for a remote-controlled system such as a robot or a satellite 
or the "disconnect" message to a cell-phone. Its false-alarm could cause unnecessary shutdowns and other 
system troubles. Such special messages demand small false alarm probability. 

We call such problems as "message-wise UEP". In conventional framework, every bit is as important as every 
other bit and every message is as important as every other message. In short in conventional framework it is 
assumed that all the information is "created equal". In such a framework there is no reason to distinguish between 
bit-wise or message wise error probabilities because message-wise error probability is larger than bit-wise error 
probability by an insignificant factor, in terms of exponents. However, in the UEP setting, it is necessary to 
differentiate between message-errors and bit-errors. We will see that in many situations, error probability of 
special bits and messages have behave very differently. 

The main contribution of this paper is a set of results, identifying the performance limits and optimal coding 
strategies, for a variety of UEP scenarios. We focus on a few simplified notions of UEP, most with immediate 
practical applications, and try to illustrate the main insights for them. One can imagine using these UEP strategies 
for embedding protocol information within the actual data. By eliminating a separate control channel, this can 
enhance the overall bandwidth and/or energy efficiency. 

For conceptual clarity, this article focuses exclusively on situations where the data rate is essentially equal 
to the channel capacity. These situation can be motivated by the scenarios where data rate is a crucial system 
resource that can not be compromised. In these situations, no positive error exponent in the conventional sense 
can be achieved. That is, if we aim to protect the entire information uniformly well, neither bit-wise nor message- 
wise error probabilities can decay exponentially fast with increasing code length. We ask the question then "can 
we make the error probability of a particular bit, or a particular message, decay exponentially fast with block 
length?" 

When we break away from the conventional framework and start to provide better protection to against certain 
kinds of errors, there is no reason to restrict ourselves by assuming that those errors are erroneous decoding of 
some particular bits or missed detections or false alarms associated with some particular messages. A general 
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formulation of UEP could be an arbitrary combination of protection demands against some specific kinds of 
errors. In this general definition of UEP, bit-wise UEP and message-wise UEP are simply two particular ways 
of specifying which kinds of errors are too costly compared to others. 

In the following, we start by specifying the channel model and giving some basic definitions in Section UT1 
Then in section [TTl] we discuss bit- wise UEP and message- wise UEP for block codes without feedback. Theorem 
Q] shows that for data-rates approaching capacity, even a single bit cannot achieve any positive error exponent. 
Thus in bit-wise UEP, the data-rate must back off from capacity for achieving any positive error exponent even 
for a single bit. On the contrary, in message-wise UEP, positive error exponents can be achieved even at capacity. 
We first consider the case when there is only one special message and show that, Theorem [2] optimal (missed- 
detection) error exponent for the special message is equal to the red-alert exponent, which is defined in section 
IIII-BI We then consider situations where an exponentially large number of messages are special and each special 
message demands a positive (missed detection) error exponent. (This situation has previously been analyzed 
before in [12], and a result closely related to our has been reported there.) Theorem [3] shows a surprising result 
that these special messages can achieve the same exponent as if all the other (non-special) messages were absent. 
In other words, a capacity achieving code and an error exponent-optimal code below capacity can coexist without 
hurting each other. These results also shed some new light on the structure of capacity achieving codes. 

Insights from the block codes without feedback becomes useful in Section [TV] where we investigate similar 
problems for variable length block codes with feedback. Feedback together with variable decoding time creates 
some fundamental connections between bit-wise UEP and message-wise UEP. Now even for bit-wise UEP, a 
positive error exponent can be achieved at capacity. Theorem [5] shows that a single special bit can achieve the 
same exponent as a single special message — the red-alert exponent. As the number of special bits increases, the 
achievable exponent for them decays linearly with their rate as shown in Theorem [6] Then Theorem [7] generalizes 
this result to the case when there are multiple levels of specialty — most special, second-most special and so on. 
It uses a strategy similar to onion-peeling and achieves error exponents which are successively refmable over 
multiple layers. For single special message case, however, Theorem [8] shows that feedback does not improve 
the optimal missed detection exponent. The case of exponentially many messages is resolved in Theorem [9] 
Evidently many special messages cannot achieve an exponent higher than that of a single special message, i.e. 
red-alert exponent. However it turns out that the special messages can reach red-alert exponent at rates below 
a certain threshold, as if all the other special messages were absent. Furthermore for the rates above the very 
same threshold, special messages reach the corresponding value of Burnashev's exponent, as if all the ordinary 
messages were absent. 

Section [V] then addresses message- wise UEP situations where special messages demand small probability of 
false-alarms instead of missed-detections. It considers the case of fixed length block codes with out feedback 
as well as variable length block codes with feedback. This discussion for false-alarms was postponed from 
earlier sections to avoid confusion with the missed-detection results in earlier sections. Some future directions 
are discussed briefly in Section [VT] 

After discussing each theorem, we will provide a brief description of the optimal strategy, but refrain from 
detailed technical discussions. Proofs can be found in later sections. In section IVIII and section I VIII I we will 
present the proofs of the results in Section [Till on block codes without feedback, and Section [TVj on variable 
length block codes with feedback, respectively. Lastly in Section [IX] we discuss the proofs for the false-alarm 
results of Section [V] Before going into the presentation of our work let us give a very brief overview of the 
previous work on the problem, in different fields. 

A. Previous Work and Contribution 

The simplest method of unequal error protection is to allocate different channels for different types of data. 
For example, many wireless systems allocate a separate "control channel", often with short codes with low 
rate and low spectral efficiency, to transmit control signals with high reliability. The well known Gray code, 
assigning similar bit strings to close by constellation points, can be viewed as UEP: even if there is some error 
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in identifying the transmitted symbol, there is a good chance that some of the bits are correctly received. But 
clearly this approach is far from addressing the problem in any effective way. 

The first systematic consideration of problem in coding theory was within the frame work of linear codes. In 
[24], Masnick and Wolf suggested techniques which protects different parts (bits) of the message against different 
number of channel errors (channel symbol conversions). This frame work has extensively studied over the years 
in [22], [16], [7], [26], [21], [27], [8] and in many others. Later issue is addressed within frame work of Low 
Density Parity Check (LDPC) codes too [39], [29], [30], [32], [31], and [28]. 

"Priority encoded transmission" (PET) was suggested by Albenese et.al. [2] as an alternative model of the 
problem, with packet erasures. In this approach guarantees are given not in terms of channel errors but packet 
erasures. Coding and modulation issues are addressed simultaneously in [10]. For wireless channels, [15] analyzes 
this problem in terms of diversity-multiplexing trade-offs. 

In contrast with above mentioned work, we pose and address the problem within the information theoretic 
frame work. We work with the error probabilities and refrain from making assumptions about the particular 
block code used while proving our converse results. This is the main difference between our approach and the 
prevailing approach within the coding theory community. 

In [3], Bassalygo et. al. considered the error correcting codes whose messages are composed of two group 
of bits, each of which requires different level of protection against channel errors and provided inner and outer 
bounds to the achievable performance, in terms of hamming distances and rates. Unlike other works within 
coding theory frame work, they do not make any assumption about the code. Thus their results can indeed be 
reinterpreted in our framework as a result for bit wise UEP, on binary symmetric channels. 

Some of the the UEP problems have already been investigated within the framework of information theory 
too. Csiszar studied message wise UEP with many messages in [12]. Moreover results in [12] are not restricted 
to the rates close to capacity, like ours. Also messages wise UEP with single special message was dealt with in 
[23] by Kudryashov. In [23], an UEP code with single special message is used as a subcode within a variable 
delay communication scheme. The scheme proposed in [23] for the single special message case is a key building 
block in many of the results in section [TV] However the optimality of the scheme was not proved in [23]. We 
show that it is indeed optimal. 

The main contribution of the current work is the proposed frame work for UEP problems within information 
theory. In addition to the particular results presented on different problems and the contrasts demonstrated between 
different scenarios, we believe the proof techniques used in subsection^ |VILAJ IVIII-B.2I and IVIII-D.2I are novel 
and they are promising for the future work in the field. 

II. Channel Model and Notation 

A. DMC's and Block Codes 

We consider a discrete memoryless channel (DMC) Wy|_x-> with input alphabet X = {1,2, . . . , 1^1} and output 
alphabet y = {1,2,...,|3^|}. The conditional distribution of output letter Y when the channel input letter X 
equals i G X is denoted by WV|Jt("K)- 

Pr [y = j\x = i] = Wy\ X v\i) vt e x, vj g y. 

We assume that all the entries of the channel transition matrix are positive, that is, every output letter is reachable 
from every input letter. This assumption is indeed a crucial one. Many of the results we present in this paper 
change when there are zero-probability transitions. 

A length n block code without feedback with message set M. = {1, 2, . . . , \M.\} is composed of two mappings, 
encoder mapping and decoder mapping. Encoder mapping assigns a length n codeword^ 

x n (k) = (xi(Jfe), x 2 (k) ■■■ , x n (k)) VkeM 

'The key idea in subsection I VIII-B^2l is a generalization of the approach presented in [4]. 

2 Unless mentioned otherwise, small letters (e.g. x) denote a particular value of the corresponding random variable denoted in capital 
letters (e.g. X). 
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where xt(k) denotes the input at time t for message k. Decoder mapping, M, assigns a message to each possible 
channel output sequence, i.e. M : y n — > M.. 

At time zero, the transmitter is given the message M, which is chosen from M. according to a uniform 
distribution. In the following n time units, it sends the corresponding codeword. After observing Y n , receiver 
decodes a message. The error probability P e and rate R of the code is given by 



Pr 



Pr 



M / M 



and 



R 



\n\M\ 



B. Different Kinds of Errors 

While discussing message-wise UEP, we consider the conditional error probability for a particular message 

i £ M, 



Pr 



M / % 



M 



Recall that this is the same as the missed detection probability for message i. 

On the other hand when we are talking about bit-wise UEP, we consider message sets that are of the form 
M. = M.\ x M.2- In such cases message M is composed of two submessages, M = (Mi, M%). First submessage 
M\ corresponds to the high-priority bits while second submessage M2 corresponds to the low-priority bits. The 
uniform choice of M from M., implies the uniform and independent choice of M\ and M2 from M\ and M2 
respectively. Error probability of a submessage Mj is given by 



Pr 



3 = 1,2 



Note that the overall message M is decoded incorrectly when either Mi or M2 or both are decoded incorrectly. 



The goal of bit-wise UEP is to achieve best possible Pr 



Mi / Mi 



while ensuring a reasonably small P e 



Pr 



C. Reliable Code Sequences 

We focus on systems where reliable communication is achieved in order to find exponentially tight bounds for 
error probabilities of special parts of information. We use the notion of code-sequences to simplify our discussion. 
A sequence of codes indexed by their block-lengths is called reliable if 

lim P e (n) = 

n— >oo 

For any reliable code-sequence Q, the rate Rq is given by 

Rq a liminf i^)| 

n— >oo n 

The (conventional) error exponent of a reliable sequence is then 

E Q 4 liminf - lnP - (T,) 

Thus the number of messages in Q i^l = e nRa and their average error probability decays like e~ nEa with block 
length. Now we can define error exponent E(R) in the conventional sense, which is equivalent to the ones given 
in [20], [36], [13], [17], [25]. 

3 The = sign denotes equality in the exponential sense. For a sequence a'"', 

(n) . nF ^ tp !•„ - r ^ « 

a K ' = e <S> F = lim mi 
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Definition 1: For any R < C the error exponent E{R) is denned as 

E(R) = sup Eq 

Q:Rq>R 

As mentioned previously, we are interested in UEP when operating at capacity. We already know, [36], that 
E{C) = 0, i.e. the overall error probability cannot decay exponentially at capacity. In the following sections, 
we show how certain parts of information can still achieve a positive exponent at capacity. In doing that, we 
are focusing only on the reliable sequences whose rates are equal to C. We call such reliable code sequences as 
capacity-achieving sequences. 

Through out the text we denote Kullback-Leibler (KL) divergence between two distributions a x {-) and Px(-) 
as D(ax(-)\\Px(-)). 

£(ax(-)ll/3x(-)) = E a *« ln — 



iex 

Similarly conditional KL divergence between VyixOI") an d WyixOI') under P x (-) is given by 
D (V Y \ X (.\X)\\ W Y \ X (.\X)\ P x ) = Px(i) £ V Y]x (j\i) In $£$j> 

The output distribution that achieves the capacity is denoted by P Y and a corresponding input distribution is 
denoted by P x . 

III. UEP at Capacity: Block Codes without Feedback 

A. Special bit 

We first address the situation where one particular bit (say the first) out of the total log 2 \M\ bits is a special 
bit — it needs a much better error protection than the overall information. The error probability of the special bit is 
required to decay as fast as possible while ensuring reliable communication at capacity, for the overall code. The 
single special bit is denoted by Mi where Mi = {0, 1} and over all message M is of the form M = (Mi, M2) 
where M = Mi x Mi- The optimal error exponent £4, for the special bit is then defined as follow^- 

Definition 2: For a capacity-achieving sequence Q with message sets M^ = Mi xM^ where Mi = {0, 1}, 
the special bit error exponent is defined as 

-InPr <"'[Mi^Mil 



Eb q = lim inf 



+00 



Then E^ is defined as E\, = supg E\> q. 



Thus if Pr ^ 



Mi / Mi 



exp(— nEy, q) for a reliable sequence Q, then E\, is the supremum of E\, q over 

all capacity-achieving Q's. 

Since E(C) = 0, it is clear that the entire information cannot achieve any positive error exponent at capacity. 
However, it is not clear whether a single special bit can steal a positive error exponent E\, at capacity. 
Theorem 1: 

E h = 



This implies that, if we want the error probability of the messages to vanish with increasing block length and 
the error probability of at least one of the bits to decay with a positive exponent with block length, the rate of 
the code sequence should be strictly smaller than the capacity. 

Proof of the theorem is heavy in calculations, but the main idea behind is the "blowing up lemma" [13]. 
Conventionally, this lemma is only used for strong converses for various capacity theorems. It is also worth 
mentioning that the conventional converse techniques like Fano's inequality are not sufficient to prove this result. 

4 Appendix A discusses a different but equivalent type of definition and shows why it is equivalent to this one. These two types of 
definitions are equivalent for all the UEP exponents discussed in this paper. 



Fig. 1. Splitting the output space into 2 distant enough clusters. 



Intuitive interpretation: Let the shaded balls in Fig. Q] denote the minimal decoding regions of the messages. 
These decoding regions ensure reliable communication, they are essentially the typical noise-balls ([11]) around 
codewords. The decoding regions on the left of the thick line corresponds to Mi = 1 and those on the right 
correspond to the same when M\ = 0. Each of these halves includes half of the decoding regions. Intuitively, the 
blowing up lemma implies that if we try to add slight extra thickness to the left clusters in Figure [T] it blows up 
to occupy almost all the output space. This strange phenomenon in high dimensional spaces leaves no room for 
the right cluster to fit. Infeasibility of adding even slight extra thickness implies zero error exponent the special bit. 



B. Special message 

Now consider situations where one particular message (say M = 1) out of the = e nC total messages is a special 
message — it needs a superior error protection. The missed detection probability for this 'emergency' message 
needs to be minimized. The best missed detection exponent E m< i is defined as follows]! 

Definition 3: For a capacity-achieving sequence Q, missed detection exponent is defined as 



E m d,Q = liminf 



-InPr <">[M^l|M=l] 



too 



Then E m< i is defined as E m d = supg £W,q- 

Compare this with the situation where we aim to protect all the messages uniformly well. If all the messages 
demand equally good missed detection exponent, then no positive exponent is achievable at capacity. This follows 
from the earlier discussion about E(C) = 0. Below theorem shows the improvement in this exponent if we only 
demand it for a single message instead of all. 

Definition 4: The parameter C is defined^ as the red-alert exponent of a channel. 

C 4 nmxD(PP(-)\\W Y ix(-\i)) 

We will denote the input letter achieving above maximum by x r . 

Theorem 2: 

E m d = C. 

Note that the definition obtained by replacing Pr (n) \k ^ l| M = l] by mm, Pr [M/ j| M = j'j is equivalent to the one given 
above, since we are taking the supremum over Q anyway. In short, the message j with smallest conditional error probability could always 
be relabeled as message 1. 

6 Authors would like to thank Krishnan Eswaran of UC Berkeley for suggesting this name. 
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Recall that Karush-Kuhn-Tucker (KKT) conditions for achieving capacity imply the following expression for 
capacity, [20, Theorem 4.5.1]. 

C = maxD (Wy|x(-|t)|| P y(-)) 

Note that simply switching the arguments of KL divergence within the maximization for C, gives us the expression 
for C. The capacity C represents the best possible data-rate over a channel, whereas red-alert exponent C represents 
the best possible protection achievable for a message at capacity. 

It is worth mentioning here the "very noisy" channel in [20]. In this formulation [6], the KL divergence 
is symmetric, which implies D (P Y (-)\\ Wy|x("K)) ~ D {Wy\x{'V)\\ -fy('))- Hence the red-alert exponent and 
capacity become roughly equal. For a symmetric channel like BSC, all inputs can be used as x r . Since the P Y is 
the uniform distribution for these channels, C = D [P Y (-)\\ Wyix( , |*')) f° r any input letter i. This also happens 
to be the sphere -packing exponent E sp (0) of this channel [36] at rate 0. 

Optimal strategy: Codewords of a capacity achieving code are used for the ordinary messages. Codeword for 
the special message is a repetition sequence of the input letter x r . For all the output sequences special message 
is decoded, except for the output sequences with empirical distribution (type) approximately equal to P Y . For 
the output sequences with empirical distribution approximately P Y , the decoding scheme of the original capacity 
achieving code is used. 

Indeed Kudryashov [23] had already suggested the encoding scheme described above, as a subcode for his non- 
block variable delay coding scheme. However discussion in [23] does not make any claims about the optimality 
of this encoding scheme. 

Intuitive interpretation: Having a large missed detection exponent for the special message corresponds to having 
a large decoding region for the special message. This ensures that when M = 1, i.e. when the special message 
is transmitted, probability of M / 1 is exponentially small. In a sense E m & indicates how large the decoding 
region of the special message could be made, while still filling = e nC typical noise balls in the remaining space. 
The red region in Fig. [2] denotes such a large region. Note that the actual decoding region of the special message 
is much larger than this illustration, because it consists of all output types except the ones close to P Y , whereas 
the ordinary decoding regions only contain the output types close to P Y . 




Fig. 2. Avoiding missed-detection 

Utility of this result is two folds: first, the optimality of such a simple scheme was not obvious before; second, 
as we will see later protecting a single special message is a key building block for many other problems when 
feedback is available. 
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C. Many special messages 

Now consider the case when instead of a single special message, exponentially many of the total = e nC 
messages are special. Let M.^ C M.^ denote this set of special messages, 

M^={1,2,--- ,\e nr ]}. 

The best missed detection exponent, achievable simultaneously for all of the special messages, is denoted by 

E m d(r). 

Definition 5: For a capacity-achieving sequence Q, the missed detection exponent achieved on sequence of 
subsets M s is defined as 

— In max Pr ( " J [k^i\M=i] 
imdoM, = limint ^ . 

n^oo 

Then for a given r < C, E md (r) is defined as, E md (r) = supg Ma E mi ±Q_ Ms where maximization is over 
ln|7Wi n) | 

M s 's such that liminf = r. 

n—foo fi 

This message wise UEP problem has already been investigated by Csiszar in his paper on joint source-channel 
coding [12]. His analysis allows for multiple sets of special messages each with its own rate and an overall rate 
that can be smaller than the capacity^ 

Essentially, E md {r) is the best value for which missed detection probability of every special message is = 
exp(— nE m( i(r)) or smaller. Note that if the only messages in the code are these \e nr ~\ special messages (instead 
of l-M^I = e nC total messages), their best missed detection exponent equals the classical error exponent E(r) 
discussed earlier. 

Theorem 3: 

E md (r)=E{r) VrG[0,C). 
Thus we can communicate reliably at capacity and still protect the special messages as if we are only 
communicating the special messages. Note that the classical error exponent E(r) is yet unknown for the rates 
below critical rate (except zero rate). Nonetheless, this theorem says that whatever E(r) can be achieved for \e nr ~\ 
messages when they are by themselves in the codebook, can still be achieved when there are = e nC additional 
ordinary messages requiring reliable communication. 

Optimal strategy: Start with an optimal code-book for \e nr ~\ messages which achieves the error exponent E(r). 
These codewords are used for the special messages. Now the ordinary codewords are added using random coding. 
The ordinary codewords which land close to a special codeword may be discarded without essentially any effect 
on the rate of communication. 

Decoder uses a two-stage decoding rule, in first stage of which it decides whether or not a special message was 
sent. If the received sequence is close to one or more of the special codewords, receiver decides that a special 
message was sent else it decides an ordinary message was sent. In the second stage, receiver employs an ML 
decoding either among the ordinary messages or the among the special messages depending on its decision in 
the first stage. 

The overall missed detection exponent E md (r) is bottle-necked by the second stage errors. It is because the first 
stage error exponent is essentially the sphere -packing exponent E^r), which is never smaller than the second 
stage error exponent E(r). 



Intuitive interpretation: This means that we can start with a code of \e nr ~\ messages, where the decoding 
regions are large enough to provide a missed detection exponent of E(r). Consider the balls around each codeword 
with sphere -packing radius (see Fig. [Ha)). For each message, the probability of going outside its ball decays 
exponentially with the sphere-packing exponent. Although, these \e nr ~\ balls fill up most of the output space, 

'Authors would like to thank Pulkit Grover of UC Berkeley for pointing out this closely related work, [12] 
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there are still some cavities left between them. These small cavities can still accommodate = e nC typical noise 
balls for the ordinary messages (see Fig. [3jb)), which are much smaller than the original \e nr ~\ balls. This is 
analogous to filling sand particles in a box full of large boulders. This theorem is like saying that the number of 
sand particles remains unaffected (in terms of the exponent) in spite of the large boulders. 




(a) Exponent optimal code (b) Achieving capacity 



Fig. 3. "There is always room for capacity" 



D. Allowing erasures 

In some situations, a decoder may be allowed declare an erasure when it is not sure about the transmitted 
message. These erasure events are not counted as errors and are usually followed by a retransmission using a 
decision feedback protocol like Hybrid- ARQ. This subsection extends the earlier result for E m &{r) to the cases 
when such erasures are allowed. 

In decoding with erasures, in addition to the message set M., the decoder can map the received sequence Y n 
to a virtual message called "erasure". Let P e rasure denote the average erasure probability of a code. 



Pr 



M 



erasure 



Previously when there was no erasures, errors were not detected. For errors and erasures decoding, erasures are 
detected errors, the rest of the errors are undetected errors and P e denotes the undetected error probability. Thus 
average and conditional (undetected) error probabilities are given by 



Pr 



M / M, M / erasure 



and P e (i) = Pr 



M ^ M,M ^ erasure 



M 



An infinite sequence Q of block codes with errors and erasures decoding is reliable, if its average error probability 
and average erasure probability, both vanish with n. 



lim P e (n) = 



and 



lim P„ 



(n) 







If the erasure probability is small, then average number of retransmissions needed is also small. Hence this 
condition of vanishingly small -P e rasure^ ensures that the effective data-rate of a decision feedback protocol 
remains unchanged in spite of retransmissions. We again restrict ourselves to reliable sequences whose rate equal 
C. 

We could redefine all previous exponents for decision-feedback (df) scenarios, i.e. for reliable codes with erasure 
decoding. But resulting exponents do not change with the provision of erasures with vanishing probability for 
single bit or single message problems, i.e. decision feedback protocols such as Hybrid-ARQ does not improve 
Eb or E m &. Thus we only discuss the decision feedback version of E m &(r). 
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Definition 6: For a capacity-achieving sequence with erasures, Q, the missed detection exponent achieved on 
sequence of subsets M. s is defined as 

— In max Pr <">[M^i,M^erasure|M=i] 

Ej A r,(r) = liminf 



Then for a given r < C, E^f d g(r) is defined as, E^f d g(r) = supg^ -E-md, q,.m, where maximization is over 
ln\Mi n) \ 

M s 's such that liminf = r. 

n—foo fi 

Next theorem shows allowing erasures increases the missed-detection exponent for r below critical rate, on 
symmetric channels. 

Theorem 4: For symmetric channels 

EZ(r)>E sp (r) Vre[0,C). 



Coding strategy is similar to the no-erasure case. We first start with an erasure code for \e nr ~\ messages like 
the one in [18]. Then add randomly generated ordinary codewords to it. Again a two-stage decoding is performed 
where the first stage decides between the set of ordinary codewords and the set of special codewords using a 
threshold distance. If this first stage chooses special codewords, the second stage applies the decoding rule in 
[18] amongst special codewords. Otherwise, the second stage uses the ML decoding among ordinary codewords. 

The overall missed detection exponent E^f d (r) is bottle-necked by the first stage errors. It is because the first- 
stage error exponent -E sp (r) is smaller than the second stage error exponent E sp (r) + C — r. This is in contrast 
with the case without erasures. 



IV. UEP at Capacity: Variable Length Block Codes with Feedback 

In the last section, we analyzed bit wise and message wise UEP problems for fixed length block codes 
(without feedback) operating at capacity. In this section, we will revisit the same problems for variable length 
block codes with perfect feedback, operating at capacity. Before going into the discussion of the problems, let 
us recall variable length block codes with feedback briefly. 

A variable length block code with feedback, is composed of a coding algorithm and a decoding rule. Decoding 
rule determines the decoding time and the message that is decoded then. Possible observations of the receiver 
can be seen as leaves of |3?|-ary tree, as in [4]. In this tree, all nodes at length 1 from the root denote all |J^| 
possible outputs at time t = 1. All non-leaf nodes among these split into further \y\ branches in the next time 
t = 2 and the branching of the non-leaf nodes continue like this ever after. Each node of depth t in this tree 
corresponds to a particular sequence, y l , i.e. a history of outputs until time t. The parent of node y t is its prefix 
Leaves of this tree form a prefix free source code, because decision to stop for decoding has to be a casual 
event. In other words the event {r = t} should be measurable in the a-field generated by Y l . In addition we 
have Pr [r < oo] = 1 thus decoding time r is Markov stopping time with respect to receivers observation. The 
coding algorithm on the other hand assigns an input letter, X t +i(y t ; i), to each message, i G A4, at each non-leaf 
node, y l , of this tree. The encoder stops transmission of a message when a leaf is reached i.e. when the decoding 
is complete. 

Codes we consider are block codes in the sense that transmission of each message (packet) starts only after 
the transmission of the previous one ends. The error probability and rate of the code are simply given by 



Pp. = Pr 



M + M 



and, R 



E[t] 

A more thorough discussion of variable length block codes with feedback can be found in [9] and [4]. 
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Earlier discussion in Section III-BI about different kinds of errors is still valid as is but we need to slightly 
modify our discussion about the reliable sequences. A reliable sequence of variable length block codes with 
feedback, Q, is any countably infinite collection of codes indexed by integers, such that 

lim P e ^ = 

k— >oo 

In the rate and exponent definitions for reliable sequences, we replace block-length n by the expected decoding 
time E [t] . Then a capacity achieving sequence with feedback is a reliable sequence of variable length block 
codes with feedback whose rate is C 

It is worth noting the importance of our assumption that all the entries of the transition probability matrix, 
Wy\x are positive. For any channel with a Wy\x which has one or more zero probability transitions, it is possible 
to have error free codes operating at capacity, [9]. Thus all the exponents discussed below are infinite for DMCs 
with one or more zero probability transitions. 



2 



A. Special bit 

Let us consider a capacity achieving sequence Q whose message sets are of the form M^ k > = Mi x M 
where Mi = {0, 1}. Then the error exponent of the Mi, i.e., the initial bit, is defined as follows. 

Definition 7: For a capacity achieving sequence with feedback, Q, with message sets M^ k ' of the form 
M^ = Mi x M-2 where Mi = {0, 1}, the special bit error exponent is defined as 



El n 4 rimm^ lnPr( " , ^ Ml ] 



b >2 ~ k^Z E\¥w] 



Then e[ is defined as e[ = sup Q E[ q 
Tlieorem 5: 



El = C. 



Recall that without feedback, even a single bit could not achieve any positive error exponent at capacity, Theorem 
Q] But feedback together with variable decoding time connects the message wise UEP and the bit wise UEP and 
results in a positive exponent for bit wise UEP. Below described strategy show how schemes for protecting a 
special message can be used to protect a special bit. 

Optimal strategy: We use a length (k + y/k) fixed length block code with errors and erasures decoding as a 
building block for our code. Transmitter first transmits Mi using a short repetition code of length y/k. If the 
tentative decision about M\, Mi, is correct after this repetition code, transmitter sends M2 with a length k capacity 
achieving code. If Mi is incorrect after the repetition code, transmitter sends the symbol x r for k time units where 
x r is the input letter i maximizing the D (-fy(-)H WV|x("|*))- If the output sequence in the second phase, Y^^, 
is not a typical sequence of P Y , an erasure is declared for the block. And the same message is retransmitted by 
repeating the same strategy afresh. Else receiver uses an ML decoder to chose M2 and M = (Mi, M2). 

The erasure probability is vanishingly small, as a result the undetected error probability of Mi in fixed length 
erasure code is approximately equal to the error probability of M, in the variable length block code. Furthermore 
E [t] is roughly (k + y/k) despite the retransmissions. A decoding error for Mi happens only when Mi ^ Mi 
and the empirical distribution of the output sequence in the second phase is close to P Y . Note that latter event 
happens with probability = e _CE M. 



B. Many special bits 

We now analyze the situation where instead of a single special bit, there are approximately E [r] rj In 2 special 
bits out of the total E [r] C/ In 2 (approx.) bits. Hence we consider the capacity achieving sequences with feedback 
having message sets of the form M^ = Mi x M 2 k ^ ■ Unlike the previous subsection where size of Mi 
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was fixed, we now allow its size to vary with the index of the code. We restrict ourselves to the cases where 

In I J^t I 

lim inf r li r ^ 1 = r. This limit gives us the rate of the special bits. It is worth noting at this point that even when 

k— >oo ' r J 

the rate r of special bits is zero, the number of special bits might not be bounded, i.e. lim inf \M.\ | might be 

infinite. The error exponent E[ its Q at a given rate r of special bits is defined as follows, 

Definition 8: For any capacity achieving sequence with feedback Q with the message sets MS*) of the form 
M {k) = M W x M (k)^ rQ and E f^ ar£ defined as 



A 



lim inf ln| ^^' 1 F f — lim inf " ln ?r ^WjH 



Then ^4s( r ) is defined as E f hits (r) 4 sup ^ fi 

Q:r ffl >r 

Next theorem shows how this exponent decays linearly with rate r of the special bits. 
Theorem 6: 

*L(r) = (! " $) C 

Notice that the exponent £'{ its (0) = C, i.e. it is as high as the exponent in the single bit case, in spite of the 
fact that here the number of bits can be growing to infinity with E [r] . This linear trade off between rate and 
reliability reminds us of Burnashev's result [9]. 

Optimal strategy: Like the single bit case, we use a fixed length block code with erasures as our building block. 
First transmitter sends M\ using a capacity achieving code of length ^k. If the tentative decision M\ is correct, 
transmitter sends M 2 with a capacity achieving code of length (1 — ^)k. Otherwise transmitter sends the channel 
input x r for (1 — time units. If the output sequence in the second phase is not typical with Py an erasure 
is declared and same strategy is repeated afresh. Else receiver uses a ML decoder to decide M 2 and decodes 
the message M as M = {M\,M2)- A decoding error for M\ happens only when an error happens in the first 
phase and the output sequence in the second phase is typical with P Y when the reject codeword is sent. But 



the probability of the later event is = e _ ^ 1_ c^ fc . The factor of (1 — ^) arises because the relative duration of 
the second phase to the over all communication block. Similar to the single bit case, erasure probability remains 
vanishingly small in this case. Thus not only the expected decoding time of the variable length block code is 
roughly equal to the block length of the fixed length block code, but also its error probabilities are roughly equal 
to the corresponding error probabilities associated with the fixed length block code. 



C. Multiple layers of priority 

We can generalize this result to the case when there are multiple levels of priority, where the most important 
layer contains E [r]ri/ lxi2 bits, the second-most important layer contains £ l [r]r2/ln2 bits and so on. For 
an L-layer situation, message set is of the form = M-i x M. 2 x "' x ^-L • ^ e assume 

without loss of generality that the order of importance of the Mj's is Mi y M 2 >-•••>- Ml. Hence we 
haveP e Ml <P e M - <---<P e M \ 

Then for any L-layer capacity achieving sequence with feedback, we define the error exponent of the s m layer 

r-,f ,. . r -InPr <fc) \M B ^M B \ 
E ^,s,Q = W^] ■ 

The achievable error exponent region of the L-layered capacity achieving sequences with feedback is the set 
of all achievable exponent vectors {E[ its 1 Q,E[ its 2 g> • • • > ^bits l-i q)- ^ ne f 0i l° wm g theorem determines that 
region. 
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Theorem 7: Achievable error exponent region of the L-layered capacity achieving sequences with feedback, 
for rate vector {r\,t2, ■ ■ ■ , rz,-i) lS the set of vectors (Ei, E2, ■ ■ ■ , El-i) satisfying, 

Ei< (l- C Vi€{l,2,...,(L-l)}. 

Note that the least important layer cannot achieve any positive error exponent because we are communicating at 
capacity, i.e. El = 0. 

Optimal strategy: Transmitter first sends the most important layer, Mi, using a capacity achieving code of length 
^k. If it is decoded correctly, then it sends the next layer with a capacity achieving code of length ^k. Else 
it starts sending the input letter x r for not only ^k time units but also for all remaining L — 2 phases. Same 
strategy is repeated for M3, M4, . . . , Ml- 

Once the whole block of channel outputs, Y k , is observed; receivers checks the empirical distribution of the 
output in all of the phases except the first one. If they are all typical with Py receiver uses the tentative decisions 
to decode, M = (M, M2, ■ ■ ■ Ml). If one or more of the output sequences are not typical with P Y an erasure is 
declared for the whole block and transmission starts from scratch. 



For each layer i, with the above strategy we can achieve an exponent as if there were only two kinds of bits 
(as in Theorem [6]) 

• bits in layer % or in more important layers k < i (i.e. special bits) 

• bits in less important layers (i.e. ordinary bits). 

Hence Theorem [7] does not only specify the optimal performance when there are multiple layers, but also shows 
that the performance we observed in Theorem [6j is successively refinable. Figure |4] shows these simultaneously 
achievable exponents of Theorem |6l for a particular rate vector (n, ■ ■ ■ , rz,_i). 

exponent 



(1 



C 

%)C 



(1 
(1 

(1 



(l_ a±u)C 



C 
C 

c 



rate 



n r 2 r 3 r 4 r 5 

Fig. 4. Successive refinability for multiple layers of priority, demonstrated on an example with six layers; £/i=i r\ = C. 



Note that the most important layer can achieve an exponent close to C if its rate is close to zero. As we move 
to the layers with decreasing importance, the achievable error exponent decays gradually. 



D. Special message 

Now consider one particular message, say the first one, which requires small missed-detection probability. 
Similar to the no-feedback case, define as its missed-detection exponent at capacity. 

Definition 9: For any capacity achieving sequence with feedback, Q, missed detection exponent is defined as 

F f A li minf - lrlPr W [^1|M=1] 

Then E L is defined as E f md 4 SU p Q E f md>Q . 
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Theorem 8: 

fJ — c 

Theorem [2] and [8] implies following corollary, 

Corollary 1: Feedback doesn't improve the missed detection exponent of a single special message: E^ d = i? mc j. 
If red-alert exponent were defined as the best protection of a special message achievable at capacity, then this result 
could have been thought of as an analog the "feedback does not increase capacity" for the red-alert exponent. 
Also note that with feedback, E^ d for the special message and e[ for the special bit are equal. 

E. Many special messages 

Now let us consider the problem where the first \e E i T i r ~\ messages are special, i.e. M S = {1,2,..., \e E ^ r ] }. 
Unlike previous problems, now we will also impose a uniform expected delay constraint as follows. 
Definition 10: For any reliable variable length block code with feedback, 

-p _A max ie jn E[r\M=i] 

1 - w\ 

A reliable sequence with feedback, Q, is a uniform delay reliable sequence with feedback if and only if 

lim r (fc) = 1. 

This means that the average E [r| M = i] for every message i is essentially equal to E [r] (if not smaller). This 
uniformity constraint reflects a system requirement for ensuring a robust delay performance, which is invariant of 
the transmitted message@Let us define the missed-detection exponent E^^r) under this uniform delay constraint. 

Definition 11: For any uniform delay capacity achieving sequence with feedback, Q, the missed detection 
exponent achieved on sequence of subsets M. s is defined as 

— In max Pr < fe > [M^i\M=i] 
EL„>, ± liminf ^ 



Then for a given r < C, we define E^ d (r) = supg ^ E^ d q M where maximization is over A4 s 's such that 

,. Jn\M { s k) \ 

lim mi r /, \ I = t. 

k^oo E [r( fc )] 

The following theorem shows that the special messages can achieve the minimum of the red-alert exponent 
and the Burnashev's exponent at rate r. 
Theorem 9: 

E f md {r) = min {C, (1 - §)D max } , V r < C. 
where D max = max D (WV|xGK)|| W Y \x(-\j))- 

For r£ [0, (1 — jp—)C] each special message achieves the best missed detection exponent C for a single special 

message, as if the rest of the special messages were absent. For r £ [(1 - -^—)C,C) special messages achieve 
the Burnashev's exponent as if the ordinary messages were absent. 

The optimal strategy is based on transmitting a special bit first. This result demonstrates, yet another time, 
how feedback connects bit-wise UEP with message-wise UEP. In the optimal strategy for bit-wise UEP with 
many bits a special message was used, whereas now in message wise UEP with many messages a special bit is 
used. The roles of bits and messages, in two optimal strategies are simply swapped between the two cases. 

Optimal strategy: We combine the strategy for achieving C for a special bit and the Yamamoto-Itoh strategy 
for achieving Burnashev's exponent [40]. In the first phase, a special bit, b, is sent with a repetition code of 

8 Optimal exponents in all previous problems remain unchanged irrespective of this uniform delay constraint. 
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\fk symbols. This is the indicator bit for special messages: it is 1 when a special message is to be sent and 
otherwise. 

If b is decoded incorrectly as b = 0, input letter x r is sent for the remaining k time unit. If it is decoded 
correctly as b = 0, then the ordinary message is sent using a codeword from a capacity achieving code. If the 
output sequence in the second phase is typical with P Y receiver use an ML decoder to chose one of the ordinary 
messages, else an erasure is declared for (k + v&) long block. 

If b = 1, then a length k two phase code with errors and erasure decoding, like the one given in [40] by 
Yamamoto and Itoh, is used to send the message. In the communication phase a length ^k capacity achieving 
code is used to send the message, M, if M G M. s . If M ^ M. s an arbitrary codeword from the length ^k 
capacity achieving code is sent. In the control phase, if M € M. s an d if it is decoded correctly at the end of 
communication phase, the accept letter x a is sent for (1 — ^)k time units, else the reject letter, xa, is sent for 
(1 — ^)k time units. If the empirical distribution in the control phase is typical with Wy|x(-|x a ) then special 
message decoded at the end of the communication phase becomes the final M, else an erasure is declared for 
(k + \/k) long block. 

Whenever an erasure is declared for the whole block, transmitter and receiver applies above strategy again 
from scratch. This scheme is repeated until a non-erasure decoding is reached. 



V. Avoiding False Alarms 

In the previous sections while investigating message wise UEP we have only considered the missed detection 
formulation of the problems. In this section we will focus on an alternative formulation of message wise 
UEP problems based on false alarm probabilities. 



A. Block Codes without Feedback 

We first consider the no-feedback case. When false-alarm of a special message is a critical event, e.g. the 



"reboot" instruction, the false alarm probability Pr 



M = 1\M ^ 1 



for this message should be minimized, rather 



than the missed detection probability Pr 



M ^ 1\M = 1 



Using Bayes' rule and assuming uniformly chosen messages we get, 



Pr 



M = l\M 7^ 1 



Pr 



M = 1,M ^ 1 



Pr [M / 1] 



M = 1\M = j 



(\M\-1) 

In classical error exponent analysis, [20], the error probability for a given message usually means its missed 
detection probability. However, examples such as the "reboot" message necessitate this notion of false alarm 
probability. 

Definition 12: For a capacity-achieving sequence, Q, such that 



limsupPr (n) 

n— >oo 



M / 1 



M = 1 



0. 



false alarm exponent is defined as 



lim inf 

n— >oo 



■InPr (™)[M=l|M^ll 



Then E ia is defined as E ia = sup Q E^ q. 
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Thus Ef a is the best exponential decay rate of false alarm probability with n. Unfortunately we do not have 
the exact expression for Ef a . However upper bound given below is sufficient to demonstrate the improvement 
introduced by feedback and variable decoding time. 

Theorem 10: 

E l fa < E fa < Ef a . 

The upper and lower bounds to the false alarm exponent are given by 

El 4 ma x min D {V Y \ X {-\X)\\ W Y]X (-\X)\ P x ) 

i&A. Vy\x'- 

T, ] v Ylx (-\j)P x (j)=w Ylx (-\i) 
El 4 rn^D (W Y]x (-\i)\\W Ylx (-\X)\P x ) . 
The maximizers of the optimizations for E\ a and Ef a are denoted by Xf l and xj u 

El= min D(V Y \ X {-\X)\\W Y \ X {-\X)\P X ) 

T, J v Ylx (-\j)plu)=w Ylx (-\x f[ ) 
K = D {Vy\x(-M\\ W Y \ X (-\X)\P X ) . 

Strategy to reach lower bound Codeword for the special message M = 1 is a repetition sequence of input letter 
Xf r Its decoding region is the typical 'noise ball' around it, the output sequences whose empirical distribution is 
approximately equal to WY\x('\ x fi)- F° r tne ordinary messages, we use a capacity achieving code-book where 
all codewords have the same empirical distribution (approx.) P x . Then for y n whose empirical distribution is 
not in the typical 'noise ball' around the special codeword, receiver makes an ML decoding among the ordinary 
codewords. 

Note the contrast between this strategy for achieving E\- a and the optimal strategy for achieving E m &. For achieving 
E m d, output sequences of any type other than the ones close to P Y were decoded as the special message; whereas 
for achieving Ef a , only the output sequences of types that are close to Wy\x{'\ x fi) w& decoded as the special 
message. 




Fig. 5. Avoiding false-alarm 



Intuitive interpretation: A false alarm exponent for the special message corresponds to having the smallest 
possible decoding region for the special message. This ensures that when some ordinary message is transmitted, 
probability of the event {M = 1} is exponentially small. We cannot make it too small though, because when the 
special message is transmitted, the probability of the very same event should be almost one. Hence the decoding 
region of the special message should at least contain the typical noise ball around the special codeword. The 
blue region in Fig. [5] denotes such a region. 
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Note that E l ia is larger than channel capacity C due to the convexity of KL divergence. 

El = max min D (V Y \ X {-\X)\\W Y \ X {-\X)\P* X ) 

i&X Vv\x: 



J2 P x(k')W Ylx (-\k') 



Ej VY\x{-\j)PxU)=W Y \x(-\i) 

>max min D \J2 P x( k ) V Y\x(-\k) 

ita. Vy\x- V 

Y, j v Y \x{M)Px(i)=WY\x{-\i) v k 
= maxD(W Ylx (-\i)\\P Y (-)) 

= C 

where P Y denotes the output distribution corresponding to the capacity achieving input distribution P x and the 
last equality follows from KKT condition for achieving capacity we mentioned previously [20, Theorem 4.5.1]. 

Now we can compare our result for a special message with the similar result for classical situation where all 
messages are treated equally. It turns out that if every message in a capacity-achieving code demands equally 
good false-alarm exponent, then this uniform exponent cannot be larger than C. This result seems to be directly 
connected with the problem of identification via channels [1]. We can prove the achievability part of their capacity 
theorem using an extension of the achievability part of E\ a . Perhaps a new converse of their result is also possible 
using such results. Furthermore we see that reducing the demand of false-alarm exponent to only one message, 
instead of all, enhances it from C to at least E\ a . 



B. Variable Length Block Codes with Feedback 

Recall that feedback does not improve the missed-detection exponent for a special message. On the contrary, 
the false-alarm exponent of a special message is improved when feedback is available and variable decoding 
time is allowed. We again restrict to uniform delay capacity achieving sequences with feedback, i.e. capacity 
achieving sequences satisfying lim = 1. 

Definition 13: For a uniform delay capacity-achieving sequence with feedback, Q, such that 



lim sup Pr 

k— >oo 



M + 1 



M = 1 



0, 



false alarm exponent is defined as 



EL 



Then Ejf a is defined as = supg E{ a q . 
Theorem 11: 



lim inf ■ 

k— >oo 



E f 



InPr < fc )[M=l|M^l] 



Note that D max > E" a . Thus feedback strictly improves the false alarm exponent, E^ > Ef a . 

Optimal strategy: We use a strategy similar to the one employed in proving Theorem [9] in subsection IIV-EI In 
the first phase, a length y/k code is used to convey whether M = 1 or not, using a special bit b = 1/^=1} • 

• If b = 0, a length k capacity achieving code with E m d = C is used. If the decoded message for the length 
k code is 1, an erasure is declared for (k + \fk~) long block. Else the decoded message of length k code 
becomes the decoded message for the whole (k + \fk) long block. 

. If b = 1, 

- and M = 1, input symbol x a is transmitted for k time units. 

- and M/l, input symbol is transmitted for k time units. 
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If the output sequence, Y^^, is typical with WV|jf( - |^a) then M = 1 else an erasure is declared for 
(k + vk) long block. 

Receiver and transmitter starts from scratch if an erasure is declared at the end of second phase. 

Note that, this strategy simultaneously achieves the optimal missed-detection exponent C and the optimal 
false-alarm exponent D max for this special message. 

VI. Future directions 

In this paper we have restricted our investigation of UEP problems to data rates that are essentially equal to 
the channel capacity. Scenarios we have analyzed provides us with a rich class of problems when we consider 
data rates below capacity. 

Most of the UEP problems has a coding theoretic version. In these coding theoretic versions deterministic 
guarantees, in terms of Hamming distances, are demanded instead of the probabilistic guarantees, in terms of 
error exponents. As we have mentioned in section iLAl coding theoretic versions of bit- wise UEP problems have 
been studied for the case of linear codes extensively. But it seems coding theoretic versions of both message-wise 
UEP problems and bit-wise UEP problem for non-linear codes are scarcely investigated [3], [5]. 

Throughout this paper, we focused on the channel coding component of communication. However, often times, 
the final objective is to communicate a source within some distortion constraint. Message-wise UEP problem itself 
has first come up within this framework [12]. But the source we are trying to convey can itself be heterogeneous, 
in the sense that some part of its output may demand a smaller distortion than other parts. Understanding optimal 
methods for communicating such sources over noisy channels present many novel joint-source channel coding 
problems. 

At times the final objective of communication is achieving some coordination between various agents [14]. In 
these scenarios channel is used for both communicating data and achieving coordination. A new class of problem 
lends itself to us when we try to figure out the tradeoffs between error exponents of the coordination and data? 

We can also actively use UEP in network protocols. For example, a relay can forward some partial information 
even if it cannot decode everything. This partial information could be characterized in terms of special bits as 
well as special messages. Another example is two-way communication, where UEP can be used for more reliable 
feedback and synchronization. 

Information theoretic understanding of UEP also gives rise to some network optimization problems. With UEP, 
the interface to physical layer is no longer bits. Instead, it is a collection of various levels of error protection. 
The achievable channel resources of reliability and rate need to be efficiently divided amongst these levels, which 
gives rise to many resource allocation problems. 

VII. Block Codes without Feedback: Proofs 

In the following sections, we use the following standard notation for entropy, conditional entropy and mutual 
information, 

H(P X )= ^P x(i )ln^ 

H{w Y \ X \Px)= £ p x {j)w Y \ x {k\j)^w^m 

jex,kcy 

I(P,W)= Y PxU)Wy\ x (k\j) In EieAf ^; ( Sk(0 • 

In addition we denote the decoding region of a message i G M by G(i), i.e. 

4 {y":M(y n )=i}. 
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A. Proof of Theorem \T\ 
Proof: 

We first show that any capacity achieving sequence Q with E^ q can be used to construct another capacity 

achieving sequence, Q' with £t>,Q' = — all members of which are fixed composition codes. Then we show 
that E\, q, = for any capacity achieving sequence, Q' which only includes fixed composition codes. 

Consider a capacity achieving sequence, Q with message sets M.^ = A4i x M.^' '> where M.\ = {0, 1}. As 
a result of Markov inequality, at least l|A / J < - n ^| of the messages in Ai^ satisfy, 



Pr 



Mi / Mi 



M 



< 5Pr 



Similarly at least ||A^^| of the messages in M.^ satisfy, 



Pr 



M / M 



M 



< 5Pr 



Mi ^ Mi 



M / M 



(1) 



(2) 



Thus at least of the messages in satisfy both £[]) and ©. Consequently at least ^j|.M( n )| 

messages are of the form (0, M2) and satisfy equations CO and (f2]). If we group them according to their empirical 
distribution at least one of the groups will have more than iQ7^zprjT*i messages because the number of different 

empirical distributions for elements of X n is less than (n + 1)1*1 We keep the first ig|^q?rji*i codewords of this 
most populous type, denote them by x^(-) and throw away all of other codeword corresponding to the messages 
of the form (0, M2). We do the same for the messages of the form M = (1, M2) and denote corresponding 
codewords by x' B (-). 

Thus we have a length n code with message set M 1 of the form M' = A4\ x M' 2 where M.\ = {0, 1} and 



\M' 2 \ 
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Pr ' 



-. Furthermore, 



M[ / M x ' 


M' = i 


< 5Pr 


Mi / Mi 


Pr 


M' / M' 


M' = i 


< 5Pr 


M / M 



Wi £ M'. 

Now let us consider following 2n long block code with message set M." =Mi x M! 2 x where M! 2 = 
M'l = M' 2 . If M" = (0, M% , Mg) then x(M") = x'^M^x'^M-f ). If M" = (l,M%,Mg) then x(M") = 
x.' B (M2)x' A (M^'). Decoder of this new length 2n code uses the decoder of the original length n code first on y n 
and then on y^L^. If the concatenation of length n codewords corresponding to the decoded halves, is a codeword 
for an i £ M" then M" = i. Else an arbitrary message is decoded. One can easily see that the error probability 
of the length 2n code is less than the twice the error probability of the length n code, i.e. 



Pr 



M" ^ M' 



M'' 



< 1 



< 2Pr 



1 - Pr 



M' ^ M' 



M' = Mi' 



)(1 - Pr M' / M' 



M' 



M' ^ M' 



Furthermore bit error probability of the new code is also at most twice the bit error probability of the length n 
code, i.e. 



Pr 



M" / Mi" 



Mi" 



< 1 



< 2Pr 



1 - Pr 



M[ / M{ M{ = M" )(1 - Pr Mi' / M[ M{ = M{' ) 



Mi' / M[ 



all members of 



Thus using these codes one can obtain a capacity achieving sequence Q! with E\,q, 
which are fixed composition codes. 

In the following discussion we focus on capacity achieving sequences, Q's which are composed of fixed 
composition codes only. We will show that E\, t Q = for all capacity achieving Q's with fixed composition 
codes. Consequently the discussion above implies that Ey, = 0. 
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We call the empirical distribution of a given output sequence, y n , conditioned on the code word, x(i), the 
conditional type of y n given the message i and denote it by V(y n ,i). Furthermore we call the set of y n, s whose 
conditional type with message i is V, the V-shell of i and denote it by Ty («)■ Similarly we denote the set of 
output sequences y n with the empirical distribution Uy, by Ty Y . 

We denote the empirical distribution of the codewords of the n th code of the sequence by pjp and the 
corresponding output distribution by Py \ i.e. 



4 n) (-)=E^0K)4 n) «- 

We simply use Px and Py whenever the value of n is unambiguous from the context. Furthermore F Y (•) stands 
for the probability measure on y n such that 

n 

n (j/ n ) = IlJV(y*)- 



fc=l 

^ is the set of y n 's for which M t = and V(y n , M(y n )) = V. 



5$ = {y n : V(p», M(y")) = V and jfr(y") = (0, j) for some j G M 2 } (3) 

In other words, S^y is the set of y n 's such that y n G Ty ^M(y n )J and decoded value of the first bit is zero. 

Note that since for each y n G y n there is a unique M{y n ) and for each y n G 3^ and message i 6M there is 
unique V(y n , z); each y n belongs to a unique 5^ or s[ n v , i.e. 5q^'s and S^'s are disjoint sets that collectively 
cover the set y n . 

Let us define the typical neighborhood of W Y \ X as [W] 

[W] ± {Vy lX :\Vy\ X (j\i)PP(i)-Wy lX (j\i)PP(i)\<^ Vi.j} (4) 

Let us denote the union of all Spy's for typical Vs by Sq = Spy. We will establish the following 

ve[W] 

inequality later. Let us assume for the moment that it holds. 

n (s^) > e-^Mc+e,,)) {i_mm_ P \ (5) 



8^n 

where lim e n = 0. 

n— +oo 

As a result of bound given in (fSj) and the blowing up lemma [13, Ch. 1, Lemma 5.4], we can conclude that 
for any capacity achieving sequence Q, there exists a sequence of (£ n ,r/ n ) pairs satisfying lim rj n = 1 and 

n— *oo 

lim = such that 

n— >oo n 



where r "(A) is the set of all y n 's which differs from an element of A in at most t n places. Clearly one can 
repeat the same argument for Y ln (s[ n ^ ) to get, 

r Y (r^(s[ n) )) >r? n . 



Consequently, 

(r^(^ n) )f|r^(5{ n) )) = f y (r^(5j n) )) + f y (r e -(s[ n) )') -f y (r^(^ n) )(Jr c (5{ n) ; 
f y l^(s^)f\^(s^)) >2 Vn -l. 
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Note that if y n G T in (s[ n ^), then there exist at least one element y n E Tp Y which differs from y n in at most 
{WWXW^+ln) places! Thus we can upper bound its probability by, 

y n G r^(5{ n) ) Fy (y n ) < e -nH(P Y )-(\y\\X\n*/i+£ n )hi\ 

where A = minij WY\x(j\i)- Thus we have 

\T e "(S^)f]TH4 n) )\ > (2rin - i) e ^(^)+(\y\\x\n^+t n )mx^ (6) 

Note that for any y n G r^(cS^ n) ) f| r £ «(«s{ n) ), there exist ay n eJ w 0') for an i of the form i = (0, M 2 ) which 
differs from y n in at most (13^1 |^|n 3 / 4 + l n ) places^ Consequently 

Pr [y n \ M = i]> e - nH (WYix\Px)+(\y\\x\n 3/i +e n )in\^ (7) 

= ^ — using eqi 

as follows, 



Since .M2 = ^— 9 — using equation ([7]) we can lower bound the probability of y n under the hypothesis Mi = 



Pr M 1 = 0] = J] Pr [y n | M = (0, j)] Pr [M = (0, j)\ M l = 0] 

> 2e~ n(p( ^ |x|Px)+R( " ))+(|y|l * |n3/4+c)lnA (8) 
Clearly same holds for M\ = 1 too, thus 

Pr [y n \ M x = 1] > 2e- n(H(M/i - |x|Px)+jR( " ,)+(|y||A ' |n3/4+c)lnA . (9) 

Consequently, 



Pr 



Mi / Mi] > ^ \ min(Pr [y n | Mi = 0] , Pr [y n \ M x = 1]) 



( > ^ e -n(H(W Ylx \P x )+R^)+(\y\\X\n 3/i +t n )ln\ 

y»er«»(^ n) )nrM^ n5 ) 

>(2r?„ - l) e ^(^)+(|y||A'|n 3 / 4 +£,01nA e -n(^(VK^|x|Px)+R < " ) )+(|y||A'|n 3 / 4 +^)lnA 

= (2?7 n - l) e «a(^,H/)-- J R("))+2(|y||A'|n 3 / 4 +C)lnA (1Q) 

where (a) follows from equations ([8]) and (O and (6) follows from equation (J6J). 
Using Fano's inequality we get, 

l{M;Y n ) -nR^ > - \n2-nR^P e ^ (11) 

where T (M; Y n ) is the mutual information between the message M and channel output Y n . In addition we can 
upper bound T (M; Y n ) as follows, 

X(M;Y»)= £ Pr[»,!/>5!f 

(a") n 

<E^EE wv,x(y*^(0) in 

ieA4 fc=l !/& 

^nJCPx,^) (12) 

'Because of the integer constraints Tp Y might actually be an empty set. If so we can make a similar argument for the Up which 
minimizes \Uy(j) — Py(j)\. However this technicality is inconsequential, 
'"integer constraints here are inconsequential too. 
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where P Y {-) = Yljex W Y \ x (-)P x (j). Step (a) follows the non-negativity of KL divergence and step (b) follows 
from the fact that all the code words are of type P x (-). 
Using equations (fTOb . (fTTT t and (fT2l ) we get 



Pr 



Mi + Mi 



> (2r) n - l)e 



ln2-n/? ( ")P c <")+2(|y||A'|n 3 / 4 +£„)lnA 



Thus using lim Pj^ = 0, lim % = 1 and lim ^=0we conclude that, 

n— >oo n— *oo n— >oo n 

Now only think left, for proving E h = 0, is to establish inequality ©. One can write the error probability of the 
n th code of Q as 

p e H = E ~& E a- W)-} )Pr[yn|M = z] 

= E e- nRin) Y^ E (1 „ ) _-})e" n(D(VV|x( - |x)l|Wi ' |x( ' |x)|Px)+H(yi ' |x|Px)) 

iSM^ V 2/»eT v (i) 

= ^ e -n(0(V y | X (-|X)||W y|x (.|X)|P x )+H(y y | X |P x )+ilW) £ (l-Ir A(n) _.0 

v ieA^M jreT v (i) 

= ^ e -«(£ , (^|x(-|X)||H/ 1 . |x (-|X)|P x )+H(^ |x |P Y )+ J R("))^g o ^ + Q ly ) (13) 

where Q fcj y = E ^ ~ J {M(y)=ip for fc = °' L 

jeM 2 

Note that Qfc y is the sum, over the messages i for which M\ = k, of the number of the elements in Ty (i) 
that are not decoded to message i. In a sense it is a measure of the contribution of the V-shells of different 
codewords to the error probability. We will use equation (TT3T ) to establish lower bounds on F y ( S^X ) 's. 



Note that all elements of Sq V have the same probability under P y (•) and 

P y (S$) = \S$\e~< n where C = E Px(x)V Y \ x {y\x) In (14) 

Note that 

C = E in + E ^x(x) W|x(yk) In 

= /(Px,Wy|x) + ^(VV|x(-|^)||^y|x(-|^)|^)+^(VV|xl^) 

+ £ Px(x)(Fy| X (2/k) - ^y|x(yk)) In 

Recall that /(-Px^Wylx) < C and min W Y \ x (i\j) = ^- Thus using the definition of [Wyix] given in equation 

2, j 

© we get, 

(<C+e n + D(V Y]x (-\X)\\W Y]x (-\X)\P x )+H(V Y]x \P x ) W Y \ X G [W Y \ X ] (15) 

where e„ = ™ In ± 
Note that 

= \M { 2 ] \ ■ |Ty (i) | - Q ,y = l\Tv (i) |e^ W - Q ,y. (16) 
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Recalling that Sq yS are disjoint and using equations (fl4l) . (031 ) and (fT6l ) we get 

ve[W] 

> e -n(C+e„) ^1 |j y ^ | e niiW _ ^ ^ g -n(i3(V y| x(-|X) ||WV,x(-|^)|Pjf)+fl"(W|x |Px)) 

ve[W] 

( > } e „(R<")-(C+ e „)) I I|Ty (t) | e -«(£ , (W|x(-|X)||H/ i .|x(-|X)|P x )+H(y i . |x |P x )) _ p g 



£ £ Pr[y n |M = i]-P e 



=n(jR (»)_ (c+e „)) | i 



^ n(fl(»)-(C+ £ „)) /l _ TOj _ p 



where (a) follows the equation (fl"3T ) and (6) follows from the Chebyshev's inequality^ 



5. Proof of Theorem [2] 

7 J Achiev ability: E mc [ > C: 
Proof: 

For each block length n, the special message is sent with the length n repetition sequence x n (l) = (x r , x r ■ ■ • , x r ) 
where x r is the input letter satisfying 

D {P Y (-)\\ W Ylx (-\x r )) = maxD (P y (-)ll WV| X (-|i)) . 

The remaining \A4\ — 1 ordinary codewords are generated randomly and independently of each other using 
capacity achieving input distribution PJ i.i.d. over time. 

Let us denote the empirical distribution of a particular output sequence y n by Q^™)- The receiver decodes to 
the special message only when the output distribution is not close to P Y . Being more precise, the set of output 
sequences close to P Y , [Py], and decoding region of the special message, G(l), are given as follows, 

[P Y ] = (JV(-) : ||JV(i) - Py(i)\\ < <flh V* G y} 0(1) = {y n : Q (r) G [P y ]}. 

Since there are at most (n + 1)W different empirical output distribution for elements of we get, 

p r (»») r y n ^ M = 1] < (n+ i)M e -nmi ^ye[p£] B (Qv(Oll^ix(>r)) 

Xhus Hm -InPr ^)^ g (l)|M=H = ^ (p * (-)|| W { .\ Xr) \ = c. 

n— »oo 

Now the only thing we are left with to prove is that we can have low enough probability for the remaining 
messages. For doing that we will first calculate the average error probability of the following random code 
ensemble. 

Entries of the codebook, other than the ones corresponding to the special message, are generated independently 
using a capacity achieving input distribution P£. Because of the symmetry average error probability is same for 
alii 1 in M.. Let us calculate the error probability of the message M = 2. 

Assuming that the second message was transmitted, Pr [y n G G(l) \ M = 2] is vanishingly small. It is because, 
the output distribution for the random ensemble for ordinary codewords is i.i.d. P y . Chebyshev's inequality 

"The claim in (b) is identical to the one in [13][Remark on page 34] 
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guarantees that probability of the output type being outside a yjVfn ball around P Y , i.e. [P Y ], is of the order 

Assuming that the second message was transmitted, Pr [y n G Uj>2^(i)| M = 2] is vanishingly small due to 
the standard random coding argument for achieving capacity [35]. 

Thus for any P e > for all large enough n average error probability of the code ensemble is smaller than P e 
thus we have at least one code with that P e . For that code at least half of the codewords have an error probability 
less then 2P e . • 

2 ) Converse: E mc j < C: In the section IVIII-D.21 we will prove that even with feedback and variable decoding 
time, the missed-detection exponent of a single special message is at most C. Thus E md < C. 



C. Proof of Theorem \3\ 

1) Achiev ability: E mc i > E(r): 
Proof: 

Special codewords: At any given block length n, we start with a optimum codebook (say C spec j a ;) for \e r ' 
messages. Such optimum codebook achieves error exponent E(r) for every message in it. 

M^i\M = i] =e~ nEir) Vi G M s = {1,2,- •• ,\e nr ]} 



Pr 



Since there are at most (n + 1)'^' different types, there is at least one type Tp x which has ^ n y xl or more 
codewords. Throw away all other codewords from C spec i a i and lets call the remaining fixed composition codebook 
as C' special . Codebook C' special is used for transmitting the special messages. 

As shown in Fig. Ha), let the noise ball around the codeword for the special message i be Bi. These balls 
need not be disjoint. Let B denote the union of these balls of all special messages. 

B= |J Bi 

If the output sequence y n G B, the first stage of the decoder decides a special message was transmitted. The 
second stage then chooses the ML candidate amongst the messages in M. s . 
Let us define B- L precisely now. 

Bi = {y n :V(y n ,i)e W(r + e,P x )} 

where W(r + e,P x ) = {V Y \ X ■ D (V yix (-\X)\\Wy\x(-\X) \ P x ) < E sp (r + e-P x )}. Recall that the sphere- 
packing exponent for input type P x at rate r, E sp (r;P x ) is given by, 

E &p (r;P x )= min D {V Y \ X (-\X) \\ W Y]X (-\X) \ P x ) 

Vy\x'I \Px,Vy \x)<r 

Ordinary codewords: The ordinary codewords are generated randomly using a capacity achieving input distri- 
bution P x . This is the same as Shannon's construction for achieving capacity. The random coding construction 
provides a simple way to show that in the cavity B c (complement of B), we can essentially fit enough typical 
noise-balls to achieve capacity. This avoids the complicated task of carefully choosing the ordinary codewords 
and their decoding regions in the cavity, B c . 

If the output sequence y n G B c , the first stage of the decoder decides an ordinary message was transmitted. 
The second stage then chooses the ML candidate from ordinary codewords. 

Error analysis: First, consider the case when a special codeword x n {i) is transmitted. By Stein's lemma and 
definition of Bi, the probability of y n ^ Bi has exponent E &p {r + e; P x ). Hence the first stage error exponent is 
at least E, p {r + e;P x ). 
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Assuming correct first stage decoding, the second stage error exponent for special messages equals E(r). 
Hence the effective error exponent for special messages is 

mm{E(r),E sp (r + e;P x )} 

Since E{r) is at most the sphere -packing exponent E sp (r; Px), [19], choosing arbitrarily small e ensures that 
missed-detection exponent of each special message equals E(r). 

Now consider the situation of a uniformly chosen ordinary codeword being transmitted. We have to make sure 
that the error probability is vanishingly small now. In this case, the output sequence distribution is i.i.d. P Y for 
the random coding ensemble. The first stage decoding error happens when y n £ \J B{. Again by Stein's lemma, 
this exponent for any particular Bi equals E : 

E ° = v P , D (yy\x(-\ x )\\ p p(-)\ p x) 

Vy\x£W(r+e,Px) 

® min I(P x ,V Ylx )+D((PV)y(-)\\Pp(-)) 

V Y \xeW(r+e,P x ) 

> mm I(Px,Vy\x) 

(c) 

> r + e 

where in (PV)y in (a) is given by (PV)y(j') = 2~2i W^y|xC?K)> (^) follows from the non-negativity of the 
KL divergence and (c) follows from the definition of sphere -packing exponent and W(r + e,Px)- 

Applying union bound over the special messages, the probability of first stage decoding error after sending 
an ordinary message is at most = exp(?ir — nE ). We have already shown that E > r + e, which ensures that 
probability of first stage decoding error for ordinary messages is at most = e~ ne for the random coding ensemble. 
Recall that for the random coding ensemble, average error probability of the second-stage decoding also vanishes 
below capacity. To summarize, we have shown these two properties of the random coding ensemble: 

1) Error probability of first stage decoding vanishes as = exp(— ne) with n when a uniformly chosen 
ordinary message is transmitted. 

2) Error probability of second stage decoding (say b^ n >) vanishes with n when a uniformly chosen ordinary 
message is transmitted. 

Since the first error probability is at most 4a( n ) for some 3/4 fraction of codes in the random ensemble, and the 
second error probability is at most Ab^ for some 3/4 fraction, there exists a particular code which satisfies both 
these properties. The overall error probability for ordinary messages is at most 4(a( n ) -\-b^>), which vanishes with 
n. We will use this particular code for the ordinary codewords. This de -randomization completes our construction 
of a reliable code for ordinary messages to be combined with the code C spec i a i for special messages. • 

2) Converse: E mc \ < E(r): The converse argument for this result is obvious. Removing the ordinary messages 
from the code can only improve the error probability of the special messages. Even then, (by definition) the best 
missed detection exponent for the special messages equals E{r). 



D. Proof of Theorem [?] 

Let us now address the case with erasures. In this achievability result, the first stage of decoding remains 
unchanged from the no-erasure case. 
Proof: 

We use essentially the same strategy as before. Let us start with a good code for \e nr ~\ messages allowing erasure 
decoding. Forney had shown in [18] that, for symmetric channels an error exponent equal to E sp (r) + C — r 
is achievable while ensuring that erasure probability vanishes with n. We can use that code for these \e nr ~\ 
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codewords. As before, for y n G Ui^> tne fi rst stage decides a special codeword was sent. Then the second stage 
applies the erasure decoding method in [18] amongst the special codewords. 

With this decoding rule, when a special message is transmitted, error probability of the two-stage decoding is 
bottle-necked by the first stage: its error exponent E sp (r+e) is smaller than that of the second stage (_E sp (r)+C— r). 
By choosing arbitrarily small e, the special messages can achieve E sp (r) as their missed-detection exponent. 

The ordinary codewords are again generated i.i.d. P\. If the first stage decides in favor of the ordinary 
messages, ML decoding is implemented among ordinary codewords. If an ordinary message was transmitted, we 
can ensure a vanishing error probability as before by repeating earlier arguments for no-erasure case. • 



VIII. Variable Length Block Codes with Feedback: Proofs 

In this section we will present a more detailed discussion of bit-wise and message wise UEP for variable 
length block codes with feedback by proving the Theorems [5] [6l 13 El and [9j In the proofs of converse results we 
need to discuss issues related with the conditional entropy of the messages given the observation of the receiver. 
In those discussion we use the following notation for conditional entropy and conditional mutual information, 

H(M\Y n ) = - Pr [M = i\ Y n ] InPr [M = i\ Y n ] 

l(M;Y n+1 \Y n ) =H(M\Y n ) - E [H(M\Y n+1 )\Y n ] . 

It is worth noting that this notation is different from widely used one, which includes a further expectation 
over the the conditioned variable. "H{M\Y n )" in the conventional notation, stands for the E\H{M\Y n )\ and 
"H{M\Y n = y n )" stands for H{M\Y n ). 



A. Proof of Theorem \5\ 
1) Achiev ability: E J b > C: 

This single special bit exponent is achieved using the missed detection exponent of a single special message, 
indicating a decoding error for the special bit. The decoding error for the bit goes unnoticed when this special 
message is not detected. This shows how feedback connects bit-wise UEP to message-wise UEP in a fundamental 
manner. 
Proof: 

We will prove that e[ > C by constructing a capacity achieving sequence with feedback, Q, such that e[ q = C. 
For that let Q! be a capacity achieving sequence such that E m &Q, = C. Note that existence of such a Q! is 
guaranteed as a result of Theorem [2] We first construct a two phase fixed length block code with feedback and 
erasures. Then using this we obtain the k th element of Q. 

In the first phase one of the two input symbols, xq and x\, with distinct output distributions^ is send for 
[V^l time units depending on M\. At time [vfc] receiver makes tentative decision M\ on message M\. Using 
Chernoff bound it can easily be shown that, [36, Theorem 5] 



Pr 



Mi / Mi 



< where fi > 



Mi ^ Mi 



Actual value of /i, however, is immaterial to us we are merely interested in finding an upper bound on Pr 
which goes to zero as k increases. 

In the second phase transmitter uses the /c tn member of Q'. The message in the second phase, M', is determined 
by M2 depending on whether Mi is decoded correctly or not at the end of the first phase. 

Mi 7^ Mi M' = 1 
Mi = Mi and M 2 = i =>• M ; = i + 1 Vf 



12 Two input symbols xq and xi are such that l^(-|ari) / W / (-|xo) 
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At the end of the second phase decoder decodes M' using the decoder of Q! ' . If the decoded message is one, i.e. 
M' = 1 then receiver declares an erasure, else M\ = M\ and M2 = M' — 1. 

Note that erasure probability of the two phase fixed length block code is upper bounded as 



Pr 



M' = 1 



< Pr 



Mi / Mi 



+ Pr [M' = l|MV l] 



< e 



-liVk , M' {k > p '(k) 
M'W-l e 



(17) 



where P e 'W is the error probability of the k th member of Q'. 

Similarly we can upper bound the probabilities of two error events associated with the two phase fixed length 
block code as follows 



Pr 
Pr 



Mi ^ Mi , M' ^ ll < P e ' {k) {l) 



m/m,mVi 



< 



M' {k) 



_ p Kk) + p e '(fe)(i) 



(18) 
(19) 



where P e 'W(l) is the conditional error probability of the l si message in the k th element of Q'. 

If there is an erasure the transmitter and the receiver will repeat what they have done again, until they get 
M' 7^ 1. If we sum the probabilities of all the error events, including error events in the possible repetitions we 
get; 



Pr 



Pr 



Mi / Mi 
M / M 



l-Pr[M'=l] 

Pr [iCf^M , M'^l] 
l-Pr[M'=l] 



Note that expected decoding time of the code is 

E[t] 



k+ 


"Vfcl 


1-Pr 


M'=lJ 



(20) 
(21) 

(22) 



Using equations (fTTT ). (fT8l ). (fT9l , (l20l . (|2TT > and (l22l one can conclude that the resulting sequence of variable 



C. 



length block codes with feedback, Q, is reliable. Furthermore Rq = C and E\ 

2) Converse: < C: 

We will use a converse result we have not proved yet, namely converse part of Theorem [8j i.e. E^ nd < C. 
Proof: 



Consider a capacity achieving sequence, Q, with message set sequence = {0,1} x A4 2 ■ Using Q we 

construct another capacity achieving sequence Q! with a special message 0, with message set sequence Ai 

{0}UM" 



/(*) 



w 

2 



us E[ < C. 

Let us denote the message of Q by M and that of Q' by M'. The /s tn code of Q' is as follow. At time receiver 
chooses randomly an Mi for fc tn element of Q and send its choice through feedback channel to transmitter. If 
the message of Q' is not 0, i.e. M' 7^ then the transmitter uses the codeword for M = (Mi,M') to convey M'. 
If M' = receiver pick a M2 with uniform distribution on M.2 and uses the code word for M = (1 — Mi, M2) 
to convey that M' = 0. 

Receiver makes decoding using the decoder of Q: if M = (Mi,z) then M' = i, if M = (1 — M\,i) then 
M' = 0. One can easily show that expected decoding time and error probability of both of the codes are same. 
Furthermore error probability of Mi in Q is equal to conditional error probability of message M' = in Q! thus, 



such that q, 



e[ q. This implies < E^ md , which together with Theorem [U E } md < C, gives 



E 



md,Q' 



E 



J 

b,e- 
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B. Proof of Theorem [6] 

1) Achievability: E f bjts (r) > (l - £) C: 
Proof: 

We will construct the capacity achieving sequence with feedback Q using a capacity achieving sequence Q! 
satisfying E m & Q, = C, as we did in the proof of theorem [5] We know that such a sequence exists, because of 
Theorem [8] 

For fc tn member of Q, consider the following two phase errors and erasures code. In the first phase transmitter 
uses the [f^J^ element of Q' to convey M\. Receiver makes a tentative decision M\. In the second phase 
transmitter uses the [(C — r)/cj m element of Q' to convey M 2 and whether M\ = M\ or not, with a mapping 
similar to the one we had in the proof of theorem [5] 

Mi ^ Mi =^ M' = 1 
Mi = Mi and M 2 = i =>- M' = i + 1 Vi 

Thus M {k) = M' i[rkl) and M { 2 k) U {\M^\ + 1} = M' {[{C ~ r)kl) . If we apply a decoding algorithm, like the 
one we had in the proof of theorem |5J going through essentially the same analysis with proof of Theorem [5J we 
can conclude that Q is a capacity achieving sequence and E[ its q = (l — ^) C and rg = r. • 

2) Converse: E f bits (r) < (l - £) C: 

In establishing the converse we will use a technique that was used previously in [4], together with lemma Q] 

which we will prove in the converse part Theorem [8] 

Proof: 

Consider any variable length block code with feedback whose message set M. is of the form A4 = M.\ x Ai 2 - 
Let tg be the first time instance that an i G M.\ becomes more likely than (1 — 8) and let t$ = tg A r. 

Recall that min W Y \x{j\^) — ^ consequently definition of t$ implies that min (1 — Pr [Mi = i| y TS \) > XS. 

Thus using Markov inequality for P e we get, 

Pr [t s = t}<% (23) 

We use equation d23l to bound expected value of the entropy of first part of the message at time t$ as follows, 

E [ft(Mi|Y T5 )] = E [W(Mi|y T ')I {7 , =r} ] + S [W(Mi|F T5 )I {T5<r} ] 
ii 

It has already been established in, [4], 



< ^ 5 \n\Mi\ + {hi2 + 8hx\Mi\) 
= \n2+{^ + 8)hx\M x \) 



WA ^ C (24) 



g[M(M)-M(M|y r O] 
Thus, 

E[t s ] > ^{E[H{M)-n(M 1 \Y Ts )-n{M 2 \M 1 ,Y Ts )}) 

>i(_l n2 + (1-^-5) In \Mi\) (25) 

Bound given in inequality ( f25T ) specifies the time needed for getting a likely candidate, M\. Like it was the case 
in [4], remaining time is the time spend for confirmation. But unlike [4] transmitter needs to convey also M 2 
during this time. 

For each realization of Y Ts divide the message set into disjoint subsets, 6>o, 0i, • • • , <9|a-1 2 | as follows, 
o = {l:leM,l = where i ^ Mi{Y Ts )} 

6 j = {I : I G M, I = (•Mi(Y Ts ),j)} Vj G {1, 2, . . . \M 2 \} 
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where M\{Y TS ) is the most likely message given Y TS . Furthermore let the auxiliary-message, M', be the index 
of the set that M belongs to, i.e. M G <9 M '- 

The decoder for the auxiliary message decodes the index of the decoded message at the decoding time r, i.e 



M'(Y T ) =j& M(Y T ) G Oj. 



With these definition we have; 

Pr \m(Y t ) / M 
Pr \M\ (Y t ) / Mi 



Y T & 



> Pr 

> Pr 



M'(Y T ) y^M' 
M'(Y T ) / 



Y Ts ,M' = 



Pr [M' = 0| Y Ts ] 



Now, we apply Lemma \T\ which will be proved in section IVIII-D.21 To ease the notation we use following 
shorthand; 



Y T f> 
Y T \M' = 



Pf'{Y T *} = Pr M'(Y T ) / M' 

Pf{0,Y Ts } = Pr \m'{Y t ) ^ 

^(Y TS ) = Pr [M'(Y TS ) = 0| Y TS ] . 
As a result of Lemma [T] for each realization of y Ts G y T& such that t$ < r, we have 



(1 -£(Y T *) - P e M '{y T5 })ln. 



p „ l{QiYTs} <ln2 + E[T-T S \Y T >]J y - ■ ■ e[t _ t j Yt 
By multiplying both sides of the inequality with Ir Ta<T i, we get an expression that holds for all Y 7 



H{M'\Y T 6)-ln2-Pf {Y T s}\n\M2\ 



I 



{ts<t} 



Iu2 + E[t-ts\Y Ts ]J 



•H(M , |y^)^ln2-Pf'{y T .s}ln|A^ 2 | 
E[t-t s \Y-s] 



(26) 



Now we take the expectation of both sides over Y Ts . For the right hand side we have, 



R.H.S. = E 



In 2 + E [r - r s \ Y«] J ( ^'^^fff }HA ^ I 



F [ T .ivt,] rr , ^(M'|y^)^ln2-P e M '{y^}ln|A4 2 | 

y Ts \ 1 \ J \ e[t-t s \y-s] 



< In 2 + £ 

(a) 

< \n2 + E[r-T S }J [E 



{ts<t} 



rf H(M'\Y T s)-ln2-Pf {Y- r s}\ n \M 2 \ 

l {rs<r} E[t-t s ] 



(b) 

< ln2 + E[r-T 5 ]J 



E [ l {r e <T}'H( M '\ YTS j\ -In2-Pe In \M 2 \ 
E[t-t s ] 



where (a) follows the concavity of J (•) and Jensen's inequality when we interpret e [ t-t s ] 6 

bility distribution over y T> and (b) follows the fact that J {■) is a decreasing function. 
Now we lower bound E [I {t5<t} H(M'\Y Ts )] in terms of E [H(M\Y Ts )]. Note that 



(27) 
as proba- 



H{M\Y TS ) = H{M'\Y Ti ) +Pr M 1 / M l {Y Tli ' 
< H(M'\Y T5 ) + Pr [Mi + M X (Y TS \ 



Y T s 
Y T $ 



H(M\M 1 / M 1 (Y TS ),Y TS ) 
hx\Mi\\M 3 \ 



31 



Furthermore for all Y TS such that r > t s , Pr M± (Y TS ) ^ M x Y TS < S. Thus 

E [I {TS<T} H(M'\Y^)] > E [l {Ts<T} (n(M\Y^) - S]n\Mi\\M 2 \)] 

= E [(1- I {rs=T} )H(M\Y^)] - 5\n\Mi\\M 2 \ 
>E[H(M\Y ts )}-Pt[t s = t} ln\Mi\\M 2 \ - Sln\Mi\\M 2 \ 

> E [H(M\Y T °)} -(& + $) In \Mi\\M 2 \ 

(b) 

>(l-%-6)]n\Mi\\M2\-CE[Ts] (28) 

where (a) follows from the inequality (l23T ). (6) follows from the inequality (l24l ). Since J7" (•) is decreasing in its 
argument, inserting d28l ) in (T27T ) we get 



< ln2 + E[t-t s ] J 



ln\Ml\\M2\[l-J%-5-Pe)-E[T S ]C-lxi2 

E[t-t s \ 



(29) 



Note that Va > 0, b > 0, C > 0, 

±(6-x)J-(^f 



_ f [ a—Cxg 
J \ b-x 

(a) 

<-J(c) 



sj a— Cxp \ d 

b-x 



a— Cxp 



where (a) follows the concavity of J (•). Thus upper bound given in equation d29l) is decreasing in E 1 [ts]. Thus 
using the lower bound on E [t$], given in (|23T ) we get, 



M.S. < In 2 + (e [r] - (1 - 6 - ^)^^ + ¥) ^ 



In|X 2 |-P e ln|A< 1 |-21n2 
£[ r ]-(l-<5-A5) C + ~C~ 



(a) 

> £ 



I {T5<T} (1 - £(y*») - Pf'{Y^}) In pp^y 

[i H<T} (i-g(y T 0-p e M, {^})] 



Now let us consider the L.H.S. we get by taking the expectation of the inequality given in 
L.H.S. = E 

[i {T(S<T }(i-5(y^)-^ M '{^})^ M '{o,^}J 
i {TS<T} (i-^(y-)-p e M '{y-})" 
i N<T} (i-^(y-)-p e M '{y-})" 



(6) . 
> -e' 1 + E 



> -e" 1 + E 



In 



In- 



' ih^s ~\ ( l -Z( Y ' s )- p f {Y TS )}Pf {o,y T « }j 



s[i {T6<T} p c M '{o,y^}] 

where (a) follows log sum inequality and (b) follows from the fact that xlnx > — e _1 . 



Note that 



E 



I {TS<T} (1 - C(Y TS ) ~ Pf{Y T *})\ > E [I {TS<T} (1 - " E [Pf{Y^} 

> E [I {TS<T} ] (1-5)- P e 

> 1 

where in last step we have used the equation (T23T ). Furthermore 



Ee _x 
X8 



E 



I {Ts <r}Pf{0,Y^} 



E 



\r s <r) Pr 



Mi = Mi 



y T4 ,Mi + Mi 



<JX E 



hrsKr} Pr 



Mi = M 



y r5 ,Mi / m x 



Pr 



M x / Mi 



y T 



< 



<5A 
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(31) 



(32) 



(33) 
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Thus using equations (PTT ). (l32l ) and (I33T ) we get 

L.tf.S. > -e 

Using the inequalities (f30l >. (l34l) and choosing <5 = 
implies 4 ts (r) < (l - £) C. 

C. Proof of of Theorem \7\ 

1) Achiev ability: 
Proof: 

Proof is very similar to the achievability proof for Theorem [6] Choose a capacity achieving sequence Q! such 
that e[ q, = C. The capacity achieving sequence with feedback, Q uses L elements of Q' as follows. 

For the kfi 1 element of code Q, transmitter uses the [k ■ rij t ' 1 element of Q' to send the first part of the 
message, M\. In the remaining phases, / > 2 transmitter uses [k • rjj element of Q' . The special message of 
the code for phase I is allocated to the error event in previous phases. 

(Mi, . . . , M G _ 1} ) + (M x , ... , M$ = 1 V/ 

(Mi, . . . , M (/ _i } ) = (Mi, . . . , M (/ _i } ) =► m; = M, + 1 V/ 

Thus M[ k) = M ,i[rkl) and for all / > 1 M.f ] yj{\M.f ] \ + 1} = 7W /(LrifcJ) . If for all I e {2,3, . . . , L}, M', / 1, 
receiver decodes all parts of the information, else it declares an erasure. We skip the error analysis because it is 
essentially the same with Theorem [6] • 

2) Converse: 
Proof: 

We prove the converse of Theorem [7] by contradiction. Evidently 

max{P e Ml , P e M \ . . . , P e M >} < P MuM 2 ,...,Mi < P M, + pM 2 + . . . + p M 3 Vj E {I, 2, . . . L} 

Thus if there exists a scheme that can reach an error exponent vector outside the region given in Theorem |7J 
there is at least one Ei such that > (1 ^ 3 )C. Then we can have two super messages as follows, 

M[ = (Mi, M 2 , . . . , Mi) and M' 2 = (M i+1 , M i+2 , ...,M t ) 

Recall that P e Ml < P e M2 < ■ ■ ■ < P e M ' . Thus this new code is a capacity achieving code, whose special bits 
have rate tq< and £^ its q, > ^bits( r S')- This is contradicting with the Theorem [6] we have already proved. Thus 
all the achievable error exponent regions should lie in the region given in Theorem [7] • 



D. Proof of of Theorem [S] 
1) Achievability: E J md > C: 

Note that any fixed length block code without feedback, is also variable-length block code with feedback, thus 
> E m( i. Using the capacity achieving sequence we have used in the achievability proof of Theorem 12 we 
get E f md > C. 



1- ft 



A<5 



rp e we get E f hits Q <(l-rg)j (C). Since J (C) 



(34) 
Cthis 
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2) Converse: Ef nd < C: 

Now we prove that even with feedback and variable decoding time, the best missed detection exponent of a single 
special message is less then or equal to C, i.e. E mi < C. Since the set of capacity achieving sequences is a subset 
of capacity achieving sequences with feedback and variable decoding time, this also implies that E mi < C. 
Instead of directly proving the converse part of Theorem [8] we first prove the following lemma. 
Lemma 1: For any variable length block code with feedback, message set M., initial entropy TL{M) and 
average error probability P e , the conditional error probability of each message is lower bounded as follows, 

i 

l-Pr[M=iJ- 



Pr 



M 



> e 



J 



H(M)-h{Pe)-Pe ln(|X|-l) 
EFF1 



Mrl+ln2 



V/' 



(35) 



where J (R) is given by the following optimization over probability distributions on X 

J{R)= max _ aZ)((P 1 ^)y(-)||TV(-k 1 )) + (l-a) J D((P 2 ^)y(-)||^(-|x 2 )) (36) 



max 

a,Xi,X2,P x ,P x - 
aI(P x ,W Ylx )+(l-a)I(P x ,W Y ix)>R 

It is worthwhile remembering the notation we introduced previously that 



(P i W) Y (-) = ^2P l x (j)W Ylx (-\j) and I(P X ,W Y]X ) 



P' 

-Of 



dm 



First thing to note about Lemma Q] is that it is not necessarily for the case of uniform probability distribution on 



M / i 



M 



depends 



the message set M.. Furthermore as long as Pr [M = i] « 1 the lower bound on Pr 
on the a priori probability distribution of the messages only through the entropy of it, H(M). 

In equation (f36b a is simply a time sharing variable, which allows us to use a (xj, P l x ) pair with low mutual 
information and high divergence together with another {xi , P x ) pair with high mutual information and low 
divergence. As a result of Caratheodory's Theorem we see that time sharing between two points of the form 
(xi, P l x ) is sufficient for obtaining optimal performance, i.e. allowing time sharing between more than two points 
of the form (xi,P x ) will not improve the value of J (R). 

Indeed for any R G [0, C] one can use the optimizing values of a, x\, x%, P\ and P x in a scheme like the one 
in Theorem [2] with time sharing and prove that missed detection exponent of J (R) is achievable for a reliable 
sequence of rate R. In that a determines how long the input letter x\ G X is used for the special message while 
P x is being used for the ordinary codewords. Furthermore arguments very similar to those of Theorem [8] can 
be used to prove no missed detection exponent higher than J (R) is achievable for reliable sequences of rate R. 
Thus J (R) is the best exponent a message can get in a rate R reliable sequence. 

One can show that J (R) is a concave function of R over its support [0, C]. Furthermore J (0) = D max and 
J (C) = C. Thus J (R) is a concave strictly decreasing function of R for < R < C. 
Proof (of Lemma HJ: 

Recall that Q(i) is the decoding region for M = i i.e. Q{i) = {y T : M(y T ) = i}. Then as a result of data 
processing inequality for KL divergence we have 

Pr' 



E 



1 Pr[y T ] 
111 Pr[W I M=i] 



>Pr[g(,)]ln p /^gL tj +Pr 



In 



Pr[g [i)\M=i] 



> 



> 



-fc(Pr[0(i)]) + Pr Q{i) 

- in 2 + Pr \g(T) : 



In 



Pr|5(i) | M=i] 



Pr[g(i)\M=i\ 

where in the last step we have used, the fact that h(Pr [G(i)]) < In 2. In addition 

W)] > Pr M + i Pr [M 7^ i] 



(37) 



Pr 



>^Pr[g(j)\M =j] Pr [M 
> (l-P e -Pr[M = i\). 



J\ 



(38) 
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Thus using the equations (I37T ) and (I38T ) we get 

Pr 



Qii) 



M = i 



> e 



l-P e -Pr[A/ = i 











[in " 


) 







Now we lower bound the error probability of the special message by upper bounding E 
let us consider the following stochastic sequence, 



In 



Pr[Y T ] 



Pr[y T | M=i] 



(39) 
For that 



<y - in Pr[Y " ] V F (in Pr [ y 'l yt " 1 l 

On - m Pr [y-|M=j] ~ ^ L 111 Pr[y t |M=j,y'-i] 



y 



t-i 



t=i 



Note that E[S n+1 \Y n ] = S n and since minWy = A we have SflSn+i -5^11^"] < 21n ^. Thus 5„ is a 
martingale, furthermore since E [r] < oo we can use [37, Theorem 2 p 487], to get 

E [S T ] =S = 0. 

Thus 



Note that 



E 



E 



In 



Pr[Y T ] 
Pr[y|M=lJ 



^ L in Pr[y t |M=i,y*- 1 ] 



m Pr[y t |M=i,y*- i j 



y 



y 



75 



ln ^ |x (y t |x t (i)) 



y 



t-1 



As a result of definition of (•) given in equation d36l ) we have, 



In 



Prfyiy- 1 ] 



y 



t-i 



<j(l(X t ;Y t \Y t - 1 )) 



p r [y t |M=i,y t - 1 ] 
where J (X t ; Y t (y*" 1 ) is given b)0 

Given Y t ^ 1 random variables M — X t — Y t forms a Markov chain. Thus 

x(x t; y (y*- 1 ) >i{M-Y t \Y t - 1 ) . 

Since (•) is a decreasing function, equations (l40l ). (14TI ) and (l42l lead to 



y 



t-i 



ln 



Pr[y T ] 
Pr[Y^\M=l] 



Note that 



(l{M;Y t |y t_1 )) 



/=i 



< £ 



E 



(a) 

< E 



^2j(l(M;Y t \Y^)) 



t=i 



r^iJ(x(M ; y|y*- 1 )) 

foj \Y t lp{M\Y t \Y+- 1 ) 
\t=i 

h±±l(M;Y t \Y^) 



(6) r i / 



E[t]J(E 



t=i 



(40) 



(41) 



(42) 



(43) 



(44) 



13 Note that unlike the conventional definition of conditional mutual information, T (Xt; Yt\Y l 1 ) is not averaged over the conditioned 
random variable Y' -1 . 
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where in both (a) and (6) we use the the concavity of the J (•) function together with Jensen's inequality. Thus 
using equations (|39l ), ( |43T > and (144b we get, 



Pr 



M + i 



M 



> l-P c -Pv[M=i] 



f E[Y. T t=i 1{M;Y t \Y t - 1 ' 
EM 



-In 2 



Since J (R) is decreasing in R, the only thing we are left to show is that 



E 



>H(M)-h(P e )-P e \n(\M\-l) 



^j(M ; y t |y t - 1 ) 

.t=i 

For that consider the stochastic sequence, 

n 

V n = H(M\Y n ) + ^l(M;Y t jy^ 1 ) 



(45) 



t=i 



Clearly E [F n+ i| Y n ] = V n and E [\V n \] < oo, thus {V n } is a martingale. Furthermore E [\V n+1 - V n \\Y n ] < K 
and E [t] < oo thus using a version of Doob's optional stopping theorem, [37, Theorem 2 p 487], we get, 



V n = E IVr 



E[H{M\Y T )}+ E 



^x(M ; y t |y*- 1 



One can write Fano's inequality as follows, 

H(M\Y T ) < h (Pv M(Y T ) / M 

Consequently 



E[H{M\Y T )} < E 



h Pr 



Y T 



Y T 



t=i 



+ Pr 



(46) 



M(Y T ) / M Y T \af\M\ - 1). 



+ E 



Pr 



M(y T ) ^ m 



Y T 



\n(\M\ - 1). 



(47) 



M(Y T ) / M 
Using the concavity of binary entropy, 

E [H{M\Y T )\ < h(P e ) + P e ln(\M\ - 1). 

Using equation (l46l ) together with equation (1471 we get the desired condition given in the equation (|45T ). • 
Above proof is for encoding schemes which does not have any randomization (time sharing), but same ideas can 
be used to establish the exact same result for general variable length block codes with randomization. Now we 
are ready to prove the converse part of the Theorem [8] 
Proof (of Converse part of Theorem [8): 

In order to prove E^ < C, first note that for capacity achieving sequences we consider Pr [M = i] 
Thus 



ln(P e J/ (i)) "°> 



< 



l_p (<=). 



J 



\n\M {k) \-h(P <! (k))-Pj k ' 1 \n(\M(k)\-l) 



E\t("1] 



+ 



In 2 

B[tW] 



(48) 



Thus for any capacity achieving sequence with feedback, 



lim 

k— >oo 
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E. Proof of of Theorem [9] 

In this subsection we will show how the strategy for sending a special bit can be combined with the Yamamoto- 
Itoh strategy when many special messages demand a missed-detection exponent. However unlike previous results 
about capacity achieving sequences, Theorems [51 [6l |7J [U we will have and additional uniform delay assumption. 

We will restrict ourself to uniform delay capacity achieving sequences^ Clearly capacity achieving sequences 
in general need not to be uniform delay. Indeed many messages, i G M, can get an expected delay, E [r| M = i] 
much larger than the average delay, E [r]. This in return can decrease the error probability of these messages. 
The potential drawback of such codes, is that their average delay is sensitive to assumption of messages being 
chosen according to a uniform probability distribution. Expected decoding time, E [r] , can increase a lot if the 
code is used in a system in which the messages are not chosen uniformly. 

It is worth emphasizing that all previously discussed exponents (single message exponent E^ d , single bit 
exponent e[, many bits exponent E[(r) and achievable multi-layer exponent regions) remain unchanged whether 
or not this uniform delay constraint is imposed. Thus the flexibility to provide different expected delays to different 
messages does not improve those exponents. 

However, this is not true for the message-wise UEP with exponentially many messages. Removing the uniform 
delay constraint can considerably enhance the protection of special messages at rate higher than (1 — jf— )C. 
Indeed one can make the exponent of all special messages, C. The flexibility of providing more resources 
(decoding delay) to special messages achieves this enhancement. However, we will not discuss those cases in 
this article and stick to uniform delay codes. 

1) Achievability: E f md (r) > minjC, (1 - §)D max }: 
The optimal scheme here reverses the trick for achieving E b : first a special bit tells to the receiver whether the 
message being transmitted is special one or not. After the decoding of this bit the message itself is transmitted. This 
further emphasizes how feedback connects bit-wise and message-wise UEP, when used with variable decoding 
time. 
Proof: 

Like all the previous achievability results, we construct a capacity achieving sequence, Q, with the desired 
asymptotic behavior. A sequence of multi phase fixed length errors and erasures codes, Q' is used as the building 
block of Q. Let us consider the kfi 1 member of Q'. In the first phase transmitter sends one of the two input 
symbols with distinct output distributions for [v^J time units in order to tell whether M G Mi or not. Let b 
be b = I| Mg;V( (fc)|.Then, as it was mentioned in subsection IVIII-A. Tl with a threshold decoding we can achieve 



Pr 



Ml 



6 = 1 



Pr 



6/0 







< e"^ where // > 0. (49) 



Actual value of \i is not important for us, we are merely interested in an upper bound vanishing with increasing 
k. 

In the second phase one of two length k codes is used depending on b. 

• If b = 0, in the second phase, transmitter uses the k^ member of a capacity achieving sequence, Q" such 
that E\, qii = C. We know that such a sequence exists because of Theorem 12 The message, M' of the Q" 
is determined using the following mapping 

M e M s => W = l 

M £ Ms => M' = M - \M S \ + 1 

At the end of the second phase, receiver decodes M'. If M' = 1, then receivers declares an erasure, M = 
erasure. If M' / 1, then M = M = M' + \M S \ - 1. 

14 Recall that for any reliable variable length block code with feedback Y is defined as F = max ' £A ^J T ^ M ~^ and uniform delay 
reliable sequences are the ones that satisfy lim^^ Tq = 1. 
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• If b = 1, transmitter uses a two phase code with errors and erasures in the second phase, like the one 
described by Yamamoto and Itoh in [40]. The two phases of this code are called communication and control 
phases, respectively. 

In communication phase transmitter uses [Y/c] tn member of a capacity achieving sequence, Q" with £?b,Q" = 
C, to convey its message, M'. The auxiliary message M' is determined as follows, 

M M s M' = 1 

M G Ms => M' = M + 1 

The decoded message of the |YAf| member of Q" is called the tentative decision of communication phase 
and denoted by M'. In the control phase, 

- if M' = M' tentative decision is confirmed by sending accept symbol x a for £{k) = k — \^k~\ time 
units. 

- if M' / M' tentative decision is rejected by sending reject symbol x& for i{k) = k — \^k] time units, 
where x a and Xd are the maximizers in the following optimization problem. 

D max = maxD {W Ylx (-\i)\\ W Y \ x (-\j)) = D (W Y]x {-\x a )\\ W Y \ x {-\x d j) 

If the output sequence in last k — time steps is typical with Wy|x( - |#a) then M' = M' else erasure is 
declared for M'. Note that the total probability of WyixOl^a) typical sequences are less than e~ i( - k ^ Dia ^~ Se(k ^ 
when M' / M' and more than 1 - <W) when M' = M' where lim 6 e r k ) = 0, [13, Corrollary 1.2, pl9]. 

£(fc)^oo 

If M' = erasure or if M' = 1 then receiver declares erasure for M, M = erasure. If M' € {2, 3, ... , \M S \ + 1}, 
then M = M = M - 1. 

Now we can calculate the error and erasure probabilities of the two phase fixed length block code. Let us denote 
the erasures by M = erasure for each k. 

For i G M s using the equation d49l ) and Bayes rule we get 

< + (P e %7 m) + 6 m ) 



Pr 



M 



erasure 



Pr 



M 7^ i, M / erasure 



M 
M 



< e 



P e k Q ,(l) + p^-^e-mio^-s^). 



For i ^ M s using the equation (|49l ) and Bayes rule we get 



Pr 



M 



erasure 



Pr 



M ytz i, M ytz erasure 



M 
M 



< e 



< e 



+ P< 



'' + P ( 



(k) 



(50) 
(51) 

(52) 
(53) 



Whenever M = erasure than transmitter and receiver try to send the message once again from scratch using 
same strategy. Then for any i G M 

Pr[M^i,M^erasure| M=i] 



Pr 



M 



E[t\ M 



1— Pr|M=erasure M- 
k+Vk 



(54) 
(55) 



l-Pr[M=erasure| M=i] 

Using equations d50l . (IBTI ). d52l . (l53l . (l54l and (1551 we conclude that that Q is capacity achieving sequence such 
that 



,. InmaxiEM. Pr[M^i,Af ^erasure M=il 
lim L pu 1 L 

k—*oo 



E\r\- - ~ min{C, (1 - §)D max } 

\n\M { s h) \ _ 



lim 

k— >oo 
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2) Converse: E f md {r) < min{C, (1 - §)D max }: 
Proof: 

Consider any uniform delay capacity achieving sequence, Q. Note that by excluding all i ^ Mk we get a 
reliable sequence, Q! such that 



P„' (fc) < Pr 



E 



-'(*) 



< r (fc) e 



M ^ M 



M eM< 



Thus 



lnPv[M^M\M&M s ] ik) ln pjw (fe) 

£[r<*>] - bIt'WI 1 



Consequently E^ d (r) < (1 — ^)£> max . Similarly by excluding all but one of the elements of .A/f s we can prove 
that £^ d (r) < C, using Theorem [8] and uniform delay condition. • 



IX. Avoiding False Alarms: Proofs 

A. Block Codes without Feedback: Proof of Theorem \W\ 

1) Lower Bound: Ef a > E^ a : 
Proof: 

As a result of the coding theorem [13, Ch. 2 Corollary 1.3, page 102 ] we know that there exits a reliable 
sequence Q! of fixed composition codes whose rate is C and whose n tn elements composition satisfies, 



^PT (i) - P x (i)\ < 

i&X 

We use the codewords of the n m element of Q' as the codewords of the ordinary messages in the n tn code in 
Q. For the special message we use a length-n repetition sequence x n (l) = {xj n Xf n ■ ■ ■ , xf t ). 

The decoding region for the special message is essentially the bare minimum. We include the typical channel 
outputs within the decoding region of the special message to ensure small missed detection probability for the 
special message, but we exclude all other output sequence y n . 

G(l) = {y n --Yl l Q to")« - W Y \ X (iM\ < </TJ^} 

Note that this definition of £7(1) itself ensures that special message is transmitted reliably whenever it is sent, 



(») 



M / 1 



M = 1 



0. 



lim Pr 

n— »oo 

The decoding regions of the ordinary messages, j = {2, 3, . . . M.^}, is the intersection of the corresponding 
decoding region in Q! with the complement of £7(1). Thus the fact that Q! is a reliable sequence implies that, 



lim Pr (n) 



M 



Consequently we have reliable communication for ordinary messages as long as lim Pr (n) [£?(!) | M = j] = 0, 



is decaying fast enough. 



Vj 7^ 1. But we prove a much stronger result to ensure that Pr (") \M = 1 \M + 1 
Before doing that let us note that in the second stage of the decoding, when we are choosing a message among 
the ordinary ones, ML decoder can be used instead of the decoding rule of the original code. Doing that will 
only decrease the average error probability. 
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Note the probability of a V-shell of a message i is equal to, 

Pr W [Ty (i)| M = i] = e -nD(V Ylx (-\X)\\W YIX (.\X)\P™) 

Note that also that £7(1) can be written as the union of V-shells of a message i as follow. 

0(1)= U "M*) Vi ^ X 

vvixev<™) 

where = {V Y \x ■ Y,j I E fc V Y \ X (j\k)P%(k) - W Y \ X (j\x fl )\ < y/T/n}- Note that since there are at most 
(1 + n)'*"^' different conditional types. 

Pt^[G(1)\M = i}< (1 + rt)™ 1 max Pr [Y v (i) | M = i] 

VV| X eV<") 

Thus for alH > 1 

Urn ~ lnPr {n) [G(i)\M=i] = min D(y m (.|X)||W m (.|Jr)|i^) 



2) Upper Bound: E fa < E l f a : 
Proof: 

As a result of data processing inequality for KL divergence we have 

gCO M=l] 



E P^HM = l]ln?J^>Pr[g(l ) |M = l]m|f^Pr[aW 



M = 1 



ln Pr 



Pr[0(l)|M^l] 

> -ln2-Pr[C7(l)|M = 1] lnPr [£?(1)| M ^ 1] (56) 
Using the convexity of the KL divergence we get 



\M\ 

E P^"|M = l]lng^(<E^ E ^b"|M = l]lng|^if 

yngyr, i=2 ^gyn 

|.M| n 

=Epnpi e p ^"iM=i]Ei- ia^as) 

i=2 y"6^" fc=l 

n |A<| 

= E E l^pr^ 0*Vpr(-l**(l))|| WV|x(-|**(<))) (5V) 

fc=l i=2 

where xj.(i) denotes the input letter for codeword of message i, at time k. 
Let us denote the empirical distribution of the Xk(i) for time k, by P Xk . 

Using equation d56l ) and ( f5Tb we get 

Pr[C7(l)|M/ 1] > e - PrP(l)|M=lj (l^S* g (^l^-^( 1 ))ll^l^-^0|i'xJ + ln^ (58) 

We show below that for all capacity achieving codes, almost all of the fe's has a which is essentially equal 
to P x . For doing that let us first define the set V (e) and 5(e) 

V(e) ± {P x :I(P x ,W Y]x )>C-e} and 5(e) = max V |P Y (i) - 
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Note that lim 6(e) = 0. As a result of Fano's inequality we have, 

l(M;Y n ) > nR( n \l - P e ) - In 2 (59) 
On the other hand using standard manipulations on mutual information we get 



k=l 

Using equation (l60b in equation (l59l we get, 



l(M;Y n ) = ^2l(P Xk ,W Ylx ) 

k=l 

n 

<Cn-eJ2\p Xkme) } (60) 



k=l 



< (C-flW(l-P.)-ln2/n) 
— e 

Let e(n) be e(n) = Jc - P»(l - P e ) - then lim e(n) = and 

V n n— >oo 

n 

Ew^ ( ">)}- ne(n) - (61) 
ft=i 

Note for any G P (e( n )) we have 

D (Wy| X (-|x fe (l))|| W Y \ x (-\X k )\P x ) < D {W Y \ x {-\x k {l))\\ W Y \ X (-\X)\P* X ) + 6(e^)D max 

<Ef a + 6(e^)D max (62) 

where Ef a = max ie * D (WW(-|i)|| WV, X (-LY) | P£) 
Using equations d6Tb and d62l 

(WV| X (-|x fc (l))|| WV|x(-TO| PxJ < n(K + <5(e^)D max + e^D max ) 
Inserting this in equation (|58T ) we get 

lim f -lnPrOfgCDI^H N < 



5. Variable Length Block Codes with Feedback: Proof of Theorem 1771 

1) Achiev ability: Ej a > D max : 
Proof: 

We construct a capacity achieving sequence with feedback, Q, by using a construction like the one we have for 
^md( r )- m f act > trus scheme achieves the false alarm exponent simultaneously with the best missed detection 
exponent, C, for the special message. 

We use a fixed length multi-phase errors and erasure code as the building block for the member of Q. In 
the first phase, b = 1im=i\ is conveyed using a length \Vk\ repetition code, like we did in subsections IVIII-ATTI 
and IVIII-E. 1 1 Recall that 



Pr 



1 



Pr 







< e" 



fj, > 



(63) 



In the second phase one of the two length k codes is used depending on b. 
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• If b = 0, transmitter uses the k^ member of a capacity achieving sequence, Q' such that E m & q> = C to 
convey the message. We know that such a sequence exists because of Theorem 12 Let the message of Q be 
the message of Q', i.e. the auxiliary message, 

M' = M. 

If at the end of the second phase M' = 1, receiver declares an erasure, M = erasure, else M is decoded 
M = M = M'. 

• If b = 1, transmitter uses a length k repetition code to convey whether M = 1 or not. 

- If M = 1, M' = 1 and transmitter sends the codeword (x a ,x a , . . . , x a ). 

- If M / 1, M' = and transmitter sends the codeword (xd, Xd, ■ ■ ■ , 
where x a and are the maximizers achieving D max : 

D max = maxD WV| X (-|j)) = D {W Y \ x {-\x a )\\ W Y \ x {-\x d )) 

Receiver decodes M' = 1 only when output sequence is typical with WyixC'l^a)- Evidently as before we 
have, [13, Corrollary 1.2, pl9]. 

(64) 
(65) 

where lim 5k = 0. 

fc^oo 

If M' = 1 then M = 1, else receiver declares erasure for the whole block, i.e. M = erasure. 
Now we can calculate the error and erasure probabilities for ( \k~\ + A;) long block code. Using the equations 
631 , ([541 , (|65T ) and Bayes' rule we get 



Pr 


M' = 


M = 1 


< h 


Pr 


M' = 1 


M = 





Pr M = erasure 
Pr M = erasure 
Pr [m g \ {1} 
Pr 



Pr 



M = 1 



M = 1 
M = i 
M = 1 
M = i 
M = i 



< e 



< p (*) 



< e -V-Vk e -k(D n 



-8k) 



Ml 



(66) 
(67) 
(68) 
(69) 
(70) 



Whenever M = erasure than transmitter tries to send the message again from scratch, using same strategy. 



Consequently all of the above error probabilities are scaled by a factor of r-»- r— — 

n J r J l-Pr[M=erasure|M=i 

the corresponding error probabilities for the variable decoding time code. Furthermore 

k+^/k 



T when we consider 



E[t\ M 



(71) 



1— Pr[M=erasure| M=i] 

Using equations (I66l ). (t67l ). (I68T ). ([691 . (1701 and (TTTb we conclude that Q is a capacity achieving code with 

<f 

md,Q 



EL„ = C and 



fa,Q — "max' 
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2) Converse: E f fa < D max : 
Proof: 

Note that as result of convexity of KL divergence we have 



E 



In 



Pr[y T |M=l] 
Pr[y T |M^l] 



M = 1 



>Prjg(l)|M = l]ln ggj;jg;;] +Pr[g(l 
>-ln2 + Pr[g(l)|M = l]ln Pr[g(i ; |M#1J 
It has already been proved in [4] that, 



M = 1 



In 



Pr 



Pr 



5(1) M=l 



Pr[y T |M=l] 
in Pr[y T |M^l] 



M = 1 



< D max £ [r| M = 1] 



(72) 



(73) 



Note that as a result of definition of T we have E [t\ M = 1] < E [r] T using this together with equations (1721 
and (1731) the we get, 

ln2+rD max g[T] 

Pr [0(1)| M / 1] > e~ Pr[5(i)|M=ij 



Thus for any uniform delay reliable sequence, Q, we have Ei Q < D 



Appendix 

A. Equivalent definitions of UEP exponents 

We could have defined all the UEP exponents in this paper without using the notion of capacity achieving 
sequences. As an example in this section we define the single-bit exponent in this alternate manner and show 
that both definitions leads to identical results. In this alternative first E\,(R) is defined as the best exponent for 
the special bit at a given data-rate R and then it is minimized over all R < C to obtain E\,. 

Definition 14: For any R > 0, Z(R) is the set of sequence of codes, Q, with message sets Ai^ such that 



\ M (n)\ > e Rn 



and 



M {n) 



Mi x 



where Mi = {0,1}. 

Definition 15: For a sequence of codes, Q, such that lim Pr ^ 

71— >00 

equals 

E bQ = liminf- lnPr( " > ^^. 
Definition 16: E\,(R) and the single bit exponent E b are defined as 



M / M 



0, singe bit exponent E b q 



(74) 



E b (R) 



sup E bQ 

inf E b (R). 

R<C 



Note that according to this definition the special bit can achieve the exponent E b , no matter how close the rate is 
to capacity. We now show why this definition is equivalent to the earlier definition in terms of capacity achieving 
sequences given in section Hill 

Lemma 2: E b = E b 
Proof: 
Eb < Ey,: 

By definition of E b , for any given 5 > 0, there exists a capacity-achieving sequence Q such that E b Q = E b and 
for large enough n, 

R{n) > C - 5. If we replace first n members of Q with codes whose rate are (C — 5) or 
higher we get another sequence Q' such that Q' G Z{C — 5) where E b Q, = E b . Thus E b (C — 5) > E b for all 
5 > 0. Consequently 

E b > E b 
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Let us first fix an arbitrarily small 5 > 0. In the table in Figure |6j row k represents a code-sequence Qk G 
Z(C — l/k), whose single-bit exponent 

£ b ,Q fc > E b (R) - 5 

Let Qk(l) represent length-/ code in this sequence. We construct a capacity achieving sequence Q from this table 
by sequentially choosing elements of Q from rows 1, 2, • • • as follows . 
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Fig. 6. Row fc denotes a reliable code sequence at rate C — 1/k. Bold path shows capacity achieving sequence Q. 



• For each sequence Qi, let rii denote the smallest block length n at which, 
1) The single bit error probability satisfies 

Mt / Mil < e -"(^(R)-2<5)) 



p r (n) 

2) The over all error probability satisfies 



M^M <l/i 



3) nj > rii-i 

• Given the sequence, n±, • • • , we choose the members of our capacity achieving code from the code-table 
shown in Figure [6] as follows. 

- Initialize: We use first n 2 — 1 members of Q\ as the first n 2 — 1 members of the new code. 

- Iterate: We choose codes of length ni to rij + i — 1 from the code sequence Q%+i, i.e., 

(Si(ni),Qi(ni + l)-- - ,Qi(n i+1 -l)) 

Thus Q is a sampling of the code-table as shown by the bold path in Figure [6] Note that this choice of Q is a 
capacity achieving sequence, moreover it will also achieve a single bit exponent 

Eb,Q = inf {E b {R) - 25} = E h - 25 
Choosing arbitrarily small 5 proves E b > Eb- • 
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